1

I am currently generating an .xls file using the HSSFWorkbook library in Java, which a downstream system is parsing to read some data. Recently, we ran into an issue where the file was encountering an exception of missing expected values, despite neither myself nor the downstream system making changes.

Now what's strange is, if the file is opened and re-saved, without any changes being made, the file is parsed correctly. We also see the file size change from about 6kb to 26kb when the save occurs.

Is this change in file size expected?

I'm guessing it could be something to do with Excel adding on extra blank cells/whitespace which aren't included when the file is originally built, but not too sure what's going on. I don't have access to the downstream parser so can't know exactly what's going on there.

I've tried comparing the two .xls files in Linux using the cmp function, but haven't come up with any useful findings. I have some sample files with the behaviour I am noting but I can't attach them here and I can't access any file sharing websites (blocked).

Are there any tools I can use to run a better comparison myself, and what should should I be looking for (special characters, etc.) that might cause this issue?

When I run the following command on Linux to analyze differences:

cmp -l file1. file2.xls | gawk '{printf "%08X %02X %02X\n", $1, strtonum(0$2), strtonum(0$3)}' > analysis.txt

This is the start of the output:

 00000019 3B 3E  
 00000031 00 32  
 0000003D 09 FE  
 0000003E 00 FF  
 0000003F 00 FF  
 00000040 00 FF  
 00000041 01 00  
 0000004D 0A 31  
 00000201 52 09  
 00000202 00 08  
 00000203 6F 10  
 00000205 6F 00  
 00000206 00 06  
 00000207 74 05  
 00000209 20 54  
 0000020A 00 38  
 0000020B 45 CD  
 0000020C 00 07  
 0000020D 6E C9  
 0000020E 00 C0  
 0000020F 74 01  
 00000211 72 06  
 00000212 00 07  
 00000213 79 00  
 00000215 00 E1  
 00000217 00 02  
 00000219 00 B0  
 0000021A 00 04  
 0000021B 00 C1  
 0000021D 00 02  
 00000221 00 E2 

The format is "byteLocationOfDiff byteFromFirstFile byteFromSecondFile". There is a lot more lines afterwards, but I think it's probably better to focus on the first differences that occur.

pnuts
  • 58,317
  • 11
  • 87
  • 139
dfader2
  • 131
  • 11
  • Do your xls files have vba code in it? If so then it might help to open them while bocking VBA execution, and then save the file once again. – Bas Verlaat Sep 22 '15 at 15:16
  • Nope, no vba code. It's not even a complicated spreadsheet, about 6 rows and 15 columns of String values. But when I open and re-save the file size quadruples. Could it be that the cell values that are numbers (originally inputted as Strings), are being converted to number format, and therefore are being given more byte space to store? – dfader2 Sep 22 '15 at 15:35
  • If you think it may be an issue with blank spaces, you could try pressing Ctrl+End on one of the 'grown' files - that will at least let you know if the bottom right is out of the expected area. – Trum Sep 22 '15 at 15:55
  • Yep, tried that. They both have the same end cell, which is exactly at the end of the data. I edited the original post to include some data from the comparison I ran. – dfader2 Sep 22 '15 at 16:08
  • 1
    As far as I can tell, no. – dfader2 Sep 22 '15 at 18:34

0 Answers0