2

I have the issue that Apache POI "corrupted" a xlsm / xlsx file by just reading and writing it (e.g. with the following code)

public class Snippet {
    public static void main(String[] args) throws Exception {

        String str1 = "c:/tmp/spreadsheet.xlsm";
        String str2 = "c:/tmp/spreadsheet_poi.xlsm";

        // open file
        XSSFWorkbook wb = new XSSFWorkbook(new FileInputStream(new File(str1)));

        // save file
        FileOutputStream out = new FileOutputStream(str2);
        wb.write(out);
        wb.close();
        out.close();

    }
}

Once you open the spreadsheet_poi.xlsm in Excel you'll get an error like the following

"We found a problem with some content in xxx. Do you want us to try to recover as much as we can..."?

If you say yes you'll end up with a log which could look like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
    <logFileName>error145040_01.xml</logFileName>
    <summary>Errors were detected in file 'C:\tmp\spreadsheet_poi.xlsm'</summary>
    <repairedParts>
        <repairedPart>Repaired Part: /xl/worksheets/sheet4.xml part with XML error.  Load error. Line 2, column 0.</repairedPart>
        <repairedPart>Repaired Part: /xl/worksheets/sheet5.xml part with XML error.  Load error. Line 2, column 0.</repairedPart>
        <repairedPart>Repaired Part: /xl/worksheets/sheet8.xml part with XML error.  Load error. Line 2, column 0.</repairedPart>
    </repairedParts>
</recoveryLog>

Whats the best approach to debug the issue in more detail (e.g. find out what makes poi to "corrupt" the file?

Dominique
  • 16,450
  • 15
  • 56
  • 112
Kai
  • 700
  • 7
  • 33
  • 1
    I had a similar problem and got it solved by explicitly closing the `FileInputStream` before creating a `FileOutputStream` and writing to the workbook. – deHaar Jan 07 '19 at 15:32
  • Thanks for the suggestion but that did not help in my case. I've outlined the solution below. I just wanted to raise this question in order to document the steps to debug it (as there could be various reasons why this happens). Recently we had issues because the Name Manager had references to no longer existing files. – Kai Jan 07 '19 at 16:04
  • 1
    The `apache poi` needs libraries for creating `XML`. My suspicion is that there somehow other default libraries to create XML are used because somewhat overwrites the defaults. See https://stackoverflow.com/questions/51151125/xssfworkbook-when-written-creates-a-corrupted-xlsx-document-in-spring-boot-appl for example. – Axel Richter Jan 07 '19 at 16:53

2 Answers2

3

Eventually I found how that the best approach for debugging this are two things

  1. open the affected workbook (e.g. with 7zip and format the affected sheets with an xml editor (e.g. Notepad++ > Plugins > XML Tools > Pretty print (XML only - with line breaks). After saving the files and updating the xlsm file you'll get the "real" line numbers in the Excel error log. Alternative option (which I haven't tried but should work according to the POI mailing liste: use OOXMLPrettyPrint (https://svn.apache.org/repos/asf/poi/trunk/src/ooxml/java/org/apache/poi/ooxml/dev/) to format the file and then reopen it it in excel.
  2. if the real line numbers not already help compare the sheet xml files of the original xlsx file and the one saved by poi. You'll notice that there are differences in regards to the attributes and also the order is different. In order to properly compare I used Beyond Compare with "Additional File Formats" (see https://weblogs.asp.net/lorenh/comparing-xml-files-with-beyond-compare-3-brilliant for more information). Maybe there is another diff tool that is equally good.

In my case the problem was that poi somehow changed the dimension setting from

<dimension ref="A1:XFD147"/>

to

<dimension ref="A1:XFE147"/>

(with XFE beeing a non existing column). I fixed it by removing those many empty columns in the original xlsx file.

Kai
  • 700
  • 7
  • 33
0

My professor said: "How does the mathematician find the lion in the desert?" - "First cuts the desert into two halves, finds out where is the lion, then repeats it until the lion is caught".

So, try to remove features from the Excel files, try different versions, until you find the root cause. There may be multiple causes, though.

gaborsch
  • 15,408
  • 6
  • 37
  • 48