Archive for the 'Uncategorized' Category

13
May
09

libxml Extra content at the end of the document

Ruby libxml parser that I use to process large xml files in SAX mode refused to process a file that looked perfectly valid, throwing ‘Extra content at the end of the document’ error somewhere in the middle of the file. It turned out that it disliked control character \x0B (vertical tab), which is not allowed in XML according to the spec.

To simply remove the vertical tabs from the file (or, rather, replace them with spaces), I tried using sed like this

sed s/\x0B/\ /g file.xml

but I found out that \xXX syntax is not supported by OSX sed version, which is a shame, so I used a ruby script, which, to my surprise, was quick enough to process a 800 MB file.

output = File.open("out.xml", 'w+')
File.open('file.xml').each{|p| output.puts p.gsub(/\x0B/, ' ')}




March 2010
M T W T F S S
« Jun    
1234567
891011121314
15161718192021
22232425262728
293031