0

Good day,

I am using BeautifulSoup to parse an XML file which has a tag like the sample below which contains some binary data:

<data length=1234 encoding="x-modified">
:M\ANEG9&3I6%1I8CN!68<ID(E]*%N]Y/J;:6EYM6&N:9<E9).YA*I:94*]9O.[Y
R;59Z0LEWY;74*:E!5YWM8KE[AE;48:5N"I74*:H(E#L79X57ZG1'E:85=YVE68,
:3=5=:B&FVN-Y(EU;UJ:*28FSQ#F6,ID'V:EE-JVN=APE:9X&8EYFL<67TI$DBR0
........
</data>

The tag, attributes and binary data is read all wrong as below:

<data>1234 encoding="x-modified"&gt;
:M\ANEG93I6%1I8CN!68<ID>(E]*%N]Y/J;:6EYM6<E9>).YA*I:94*]9O.[Y
R;59Z0LEWY;74*:E!5YWM8KE[AE;48:5N"I74*:H(E#L79X57ZG1'E:85=YVE68,
:3=5=:B(EU;UJ:*28FSQ#F6,ID'V:EE-JVN=APE:9X8EYFL</E9></ID></data>

Note how the data is truncated when a '<' is encountered in the data. Also note that the attribute 'length' is removed when the tag is read.

Any ideas how I can work around this are appreciated.

Thank you.

skywalker
  • 37
  • 9

1 Answers1

1

You describe this as an XML file, but it isn't.

The data is a complete mess (in XML, "<" isn't allowed in text nodes without escaping), and while BeautifulSoup is doing its best to create order out of chaos, it's not magic, and it's clearly failing on this sample.

My recommendation would be to use standards such as XML or JSON for data transfer, instead of ill-defined approximations with no formal definition. You can't reliably parse data files unless you have a specification of the format.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • While I agree with you, it is an inappropriate answer. When a question is asked, an answer is sought, not a judgement as to the utility of needing the question answered. Ironically I have this same question to answer, and while pessimistic that there is an answer, I wouldn't need anyone to tell me that my desire to decode this file is the problem. I need to decode the file. I didn't generate it and it's a valid XML related format (bar the insertion of BLOBs as content here and there which aren't alas). – Bernd Wechner Oct 31 '22 at 02:13
  • It's true that the questioner asked for a workaround rather than a solution. Nevertheless, as a software engineer with 50 years' experience, my advice is always to fix the problem properly rather than to patch up a quick fix. If you've got a data feed that's supplying broken data you need to look around you and work out why. Sometimes it's possible to repair it on arrival, but that's an expensive and unreliable solution. – Michael Kay Oct 31 '22 at 16:28
  • I am working reading a format that formally is an XML-like format, that includes BLOBs (Binary Large Objects, true binary data) as the content of certain tags. This does indeed confound reading them with any standard XML library. But as I have no control over that format definition, nor desire to request such (or that it change), it is a constraint of the problem. In general, someone asking a question may well have similar constraints. In fact, it would be my a priori assumption as it strikes me as fairly self-evident that if I could avoid the situation I'm in I would. For example. – Bernd Wechner Nov 01 '22 at 10:47
  • If you want to read a non-XML data format, then you will need to write a non-XML parser, and the first step in writing a parser for any data format is to define its grammar. Feel free to ask a new question asking for help in writing a parser for your proprietary data format, but don't expect help unless you can write a precise grammar for the format, preferably one that isn't ambiguous. – Michael Kay Nov 02 '22 at 16:19