1

I understand what the error means, that my program is consuming too much memory and for a long period of the time it is not recovering.

My program is just reading 6,2Mb xlsx file when the memory issue occures.

When I try to monitor the program, it very quickly reaches 1,2Gb in memory consumption and then it crashes. How can it reach 1,2Gb when reading 6,2Mb file?

Is there a way to open the file in chunks? So that it doesn't have to be loaded to the memory? Or any other solution?

Exactly this part causes it. But since it is a library, shouldn't it be handled somehow smartly? It is only 200 000 rows with only 3 columns. For future, I need it to work with approx. 1 mil records and more columns...

CODE:

  Workbook myWorkBook;
        Sheet mySheet;
        if (filePath.contains(".xlsx")) {
            // Finds the workbook instance for XLSX file
             myWorkBook = new XSSFWorkbook(fis);
            // Return first sheet from the XLSX workbook
             mySheet = myWorkBook.getSheetAt(0);
             myWorkBook.close(); // Should I close myWorkBook before I get data from it?
        } 
Ondrej Tokar
  • 4,898
  • 8
  • 53
  • 103
  • How much memory are you giving the program? How many rows is in the Excel file? It's a horrible format and even the actual Excel application chokes easily when you've got enough rows. The library can't outsmart the stupidity of the file format. – Kayaman Aug 07 '15 at 09:14
  • Sounds like one huge Excel file. It's not like you're loading 6.2 MB of flat text data. – Marko Topolnik Aug 07 '15 at 09:15
  • It is around 200k rows, but I need it to work with around 1mil. rows... – Ondrej Tokar Aug 07 '15 at 09:17
  • 1
    Excel was never meant to be used for those sizes. Your project has ill-conceived requirements. – Marko Topolnik Aug 07 '15 at 09:27
  • I would disregard the advice given by Kayaman. I tried to follow this advice, but it seems wrong. Especially "Since the data is XML, you can use StAX to effectively process the contents." I started with the assumption that Microsoft would put all the data into one file, sort of like what happens in a CSV file, but instead, Microsoft gives us a file with values that point at a different file that contains strings. How to match them up? That is what the Apache classes are for. How can one scan one file to find something in the other file without holding all of one file in memory? – LRK9 Dec 10 '16 at 19:35
  • I think this might be a more useful answer: http://stackoverflow.com/questions/11118703/java-lang-outofmemory-exception-while-reading-excel-file-xlsx-using-poi . – LRK9 Dec 20 '16 at 21:41
  • As the answer I linked to says, just getting rid of this one line can help a lot: fis = new FileInputStream(savedFile); – LRK9 Dec 20 '16 at 21:42
  • When I used a File instead of a FileInputStream, I no longer faced the OutOfMemoryError. – LRK9 Dec 20 '16 at 21:42

1 Answers1

4

If you wish to work with large XLSX files, you need to use the streaming XSSFReader class. Since the data is XML, you can use StAX to effectively process the contents.

Here's (one way) how to get the Inputstream from the xlsx.

OPCPackage opc = OPCPackage.open(file);
XSSFReader xssfReader = new XSSFReader(opc);
SharedStringsTable sst = xssfReader.getSharedStringsTable();
XSSFReader.SheetIterator itr = (XSSFReader.SheetIterator)xssfReader.getSheetsData();
while(itr.hasNext()) {
    InputStream sheetStream = itr.next();
    if(itr.getSheetName().equals(sheetName)) {  // Or you can keep track of sheet numbers
        in = sheetStream;
        return;
    } else {
        sheetStream.close();
    }
}

The elements are <row>, and <c> (for cell). You can create a small xlsx file, unzip it and examine the XML inside for more information.

Edit: There are some examples on processing the data with SAX, but using StAX is a lot nicer and just as efficient.

ChrisF
  • 134,786
  • 31
  • 255
  • 325
Kayaman
  • 72,141
  • 5
  • 83
  • 121
  • thank you. So when I get the InputStream, I need to create my own xlsx chunks from it? – Ondrej Tokar Aug 07 '15 at 09:46
  • I mean, what do I do with the sheetStream after that? Do I have the data which is in the file? I've never read xlsx file with an InputStream so I have no idea what to expect. – Ondrej Tokar Aug 07 '15 at 09:55
  • Ok so I get XML, I just don't understand the part with the unzip it... but ok, it's just my lack of knowledge. Thank you for your answer. – Ondrej Tokar Aug 07 '15 at 09:59
  • How could I know that XLSX is a zip file? Are all not flat files zip? – Ondrej Tokar Aug 07 '15 at 10:15
  • I meant are all non flat files zips?. – Ondrej Tokar Aug 07 '15 at 10:40
  • Thanks a lot, aren't you also providing an education lessons? – Ondrej Tokar Aug 07 '15 at 10:58
  • Apache POI has some examples for using SAX to parse XLSX files - https://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/xssf/eventusermodel - you might want to update your answer with pointers to that to help – Gagravarr Aug 07 '15 at 12:12
  • I tried to follow this advice, but it seems wrong. Especially "Since the data is XML, you can use StAX to effectively process the contents." Processing an XML file is easy but what do you get? I started with the assumption that Microsoft would put all the data into one file, sort of like what happens in a CSV file, but instead, Microsoft gives us a file with values that point at a different file that contains strings. How to match them up? That is what the Apache classes are for. How can one scan one file to find something in the other file without holding all of one file in memory? – LRK9 Dec 10 '16 at 19:34
  • Kayaman, what is the point of using XSSFWorkbook or XSSFReader except that we are trying to get data from an Excel spreadsheet which is in XML format? Unless I misread you, your advice amounts to "Just treat this like a generic XML file." But that doesn't help. Merely iterating over XML in a lazy way is easy -- getting any useful data out of the XML is the hard part. The POI libraries understand the structure of the Excel files, that is why we use them. Using generic Stax parsers means we lose the relationships that the POI code tracks for us. – LRK9 Dec 20 '16 at 21:40
  • Kayaman, I believe this is what causes the OutOfMemoryError: "since even with a File there will be a FileInputStream internally." The XSSFReader seems to do poorly when given a FileInputStream, exactly because it expects a File and converts that internally to a FileInputStream. The external FileInputStream, that I created, is what caused the OutOfMemoryError in my code. When I got rid of the FileInputStream and simply used a File, then the OutOfMemoryError disappeared and my code was suddenly able to build and consume very large Excel files. – LRK9 Dec 21 '16 at 22:58
  • Everyone should read this answer here: http://stackoverflow.com/questions/11118703/java-lang-outofmemory-exception-while-reading-excel-file-xlsx-using-poi – LRK9 Dec 21 '16 at 23:00
  • At the very least, people should try that solution first, before trying anything else. It has the virtue of simplifying one's code. – LRK9 Dec 21 '16 at 23:00