13
May
09

libxml Extra content at the end of the document

Ruby libxml parser that I use to process large xml files in SAX mode refused to process a file that looked perfectly valid, throwing ‘Extra content at the end of the document’ error somewhere in the middle of the file. It turned out that it disliked control character \x0B (vertical tab), which is not allowed in XML according to the spec.

To simply remove the vertical tabs from the file (or, rather, replace them with spaces), I tried using sed like this

sed s/\x0B/\ /g file.xml

but I found out that \xXX syntax is not supported by OSX sed version, which is a shame, so I used a ruby script, which, to my surprise, was quick enough to process a 800 MB file.

output = File.open("out.xml", 'w+')
File.open('file.xml').each{|p| output.puts p.gsub(/\x0B/, ' ')}

Advertisement

3 Responses to “libxml Extra content at the end of the document”


  1. 1 Ahsan Ali
    June 3, 2009 at 8:13 am

    A Big Thanks ! I was stumped as to why it was terminating with that error in the _middle_ of the file !

  2. 2 robin
    February 9, 2010 at 2:49 pm

    How did you discover the rogue character? I tried gsubbing away that character in my xml but it doesn’t appear to be the culprit :(

    • 3 Evgeny Shadchnev
      February 9, 2010 at 3:11 pm

      I was processing an xml file in sax mode, so I knew which record threw an exception. Then I looked in hex mode at the offending record, looking for anything below 0x1F. I also found out that things like  also raise an exception inside my parser.


Leave a Reply

Fill in your details below or click an icon to log in:

Gravatar
WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.