I am currently generating an .xls file using the HSSFWorkbook library in Java, which a downstream system is parsing to read some data. Recently, we ran into an issue where the file was encountering an exception of missing expected values, despite neither myself nor the downstream system making changes.
Now what's strange is, if the file is opened and re-saved, without any changes being made, the file is parsed correctly. We also see the file size change from about 6kb to 26kb when the save occurs.
Is this change in file size expected?
I'm guessing it could be something to do with Excel adding on extra blank cells/whitespace which aren't included when the file is originally built, but not too sure what's going on. I don't have access to the downstream parser so can't know exactly what's going on there.
I've tried comparing the two .xls files in Linux using the cmp
function, but haven't come up with any useful findings. I have some sample files with the behaviour I am noting but I can't attach them here and I can't access any file sharing websites (blocked).
Are there any tools I can use to run a better comparison myself, and what should should I be looking for (special characters, etc.) that might cause this issue?
When I run the following command on Linux to analyze differences:
cmp -l file1. file2.xls | gawk '{printf "%08X %02X %02X\n", $1, strtonum(0$2), strtonum(0$3)}' > analysis.txt
This is the start of the output:
00000019 3B 3E
00000031 00 32
0000003D 09 FE
0000003E 00 FF
0000003F 00 FF
00000040 00 FF
00000041 01 00
0000004D 0A 31
00000201 52 09
00000202 00 08
00000203 6F 10
00000205 6F 00
00000206 00 06
00000207 74 05
00000209 20 54
0000020A 00 38
0000020B 45 CD
0000020C 00 07
0000020D 6E C9
0000020E 00 C0
0000020F 74 01
00000211 72 06
00000212 00 07
00000213 79 00
00000215 00 E1
00000217 00 02
00000219 00 B0
0000021A 00 04
0000021B 00 C1
0000021D 00 02
00000221 00 E2
The format is "byteLocationOfDiff byteFromFirstFile byteFromSecondFile". There is a lot more lines afterwards, but I think it's probably better to focus on the first differences that occur.