0

Suppose you start looking at an XML file, which you parse and confirm that it in fact an XML file. Life is good.

Then someone removes a > somewhere in a file, which effectively makes the file a malformed XML from parser's stand point. As far as it's concerned, the file is no longer a properly formed XML file.

Is there a way one can one confirm that file is in fact still an XML file, albeit a malformed one?

The question extends beyond XML (obviously). How can one arrive at a conclusion that a file is "probably of a certain type", as opposed "i can't parse it and therefore it is certainly not of a certain type"?

James Raitsev
  • 92,517
  • 154
  • 335
  • 470
  • You mean guessing file type? Most types have a magic number, for text types you have to use metrics. Frequency of correctly formed tags in an XML file for instance. The exact numbers will depend on how broken the files can be. –  Mar 19 '12 at 14:22
  • Ultimately i need a reliable way to estimate the type of a file, regardless whether it is malformed. For instance, i'd like to get a statement "I am 93% sure it's an XML file". Something like that. Is magic number a way to go? – James Raitsev Mar 19 '12 at 16:02
  • 1
    If you're trying to categorise text files (which malformed xml is) then it depends what you expect to some extent. If you have to differentiate between xml and html for instance it gets quite tricky although you could look at specific tag values. I think for text files you have to use a frequency analysis of some sort. Basically, use a regex to look for open and close tag-like structures and count 'em up. More than a handful it's almost certainly an sgml variant, if the tags are all lower case it's almost certainly xml as convention and case-sensitivity means that people use lower case. –  Mar 20 '12 at 12:26
  • At the more extreme end, you could just count up < > characters and compare the frequency of those occurring to some sample data to decide if it's probably an sgml variant or not. –  Mar 20 '12 at 12:27

0 Answers0