I have two xml files, one is compressed by LZW, other is in plain text. How can I know whether is compressed or not?
-
1if you know it's XML, you could search for XML grammar clues in the first few bytes, and then validate the syntax. If no clues are found, you can try to see, if there are any clues to a LZW compressed string, and in general case just try to uncompress and validate xml again – Dmitry Ledentsov Jun 24 '14 at 09:10
-
Does the `file` command know what you've got in those compressed files? It's usually pretty good at sniffing things out and might provide a clue you can use. What LZW compression are you using? Does it have any framing or headers like `gzip` does? – tadman Jun 24 '14 at 09:11
-
@tadman That's a good point. Most compression programs will insert a magic header in the first couple of bytes, identifying the file as a compressed file. – James Kanze Jun 24 '14 at 09:29
4 Answers
The obvious thing to do would of course be feed the string to a LZW decompressor and see if there is an error and/or the length of the string increases by approximately 200%.
That aside, a (well-formed) LZW string or file stars with the magic value 0x1F 0x9D
. Of course it is possible to LZW compress a string and not include the magic value, but it is a start (very easy to check).
A (well-formed) XML document should start with an XML declaration and must start with an element, only optionally preceded by whitespace. XML declarations start with the string <?xml
and element tags must start with a letter.
Therefore, if you see anything but whitespace before encountering the first <
or if the next character that follows is not either ?
or a letter (and only letters and numbers follow before encountering a >
), then the string cannot be XML. Since you know that the string is either XML or compressed XML, it must therefore be compressed. It's probably easy enough for someone with a little regex practice to squeeze that in a 10-15 character pattern.

- 67,688
- 20
- 135
- 185
-
-
@Yakk: There might be, yes. Not like that's strictly "correct", but it's "allowable" and something you might encounter if a Windows editor was involved writing that XML. A BOM would, like ``, signify a non-compressed string (and seeing how they're invalid characters, count as "whitespace" in the wider sense). – Damon Jun 24 '14 at 12:03
Stupid simple test: Is the first character a <
?

- 208,517
- 23
- 234
- 262
-
Funny enough, the question comments provide some answers and this answer is more a comment. I think that unless you're sure about the header format of the compressed file (or corrupt file?), you could have a 1/256 chance the first character of a compressed file is a `<`. – stefaanv Jun 24 '14 at 09:20
-
LZW starts out with the first 256 "codepoints" representing themselves, and with no other code points, so the first character in the compressed file will always be the same as the first character in the original file. The second will almost certainly be different, however, and the more you read, the more certain you are to find differences (and even non-ASCII chars or illegal UTF-8). – James Kanze Jun 24 '14 at 09:28
-
1And of course, if the file is uncompressed, it could start with a BOM. – James Kanze Jun 24 '14 at 09:30
-
This was more an "if all else fails" solution, but I like Damon's more elegant approach. – tadman Jun 24 '14 at 15:17
Look for invalid or nonsense characters (like the null character). If they exist, then it's compressed.
If not, then either it's regular XML, or the file is extremely small (otherwise this would be highly unlikely).

- 205,094
- 128
- 528
- 886
This will help if you want to know whether it is compressed, so you can decompress the file and you are willing to use libraries for the heavy lifting:
Use the compression library to always try and decompress the file. Let it decide whether the file was compressed. After that pass the resulting file to the xml library and let that library decide whether you have a valid and expected xml file. If possible don't recreate functionality of common libraries, just make sure to act properly on the returned information of the libraries.

- 14,072
- 2
- 31
- 53