3

I have a list of strings in read from MongoDB (~200k lines) Then I want to write it to an excel file with Java code:

public class OutputToExcelUtils {

    private static XSSFWorkbook workbook;
    private static final String DATA_SEPARATOR = "!";

    public static void clusterOutToExcel(List<String> data, String outputPath) {

        workbook = new XSSFWorkbook();
        FileOutputStream outputStream = null;

        writeData(data, "Data");


        try {
            outputStream = new FileOutputStream(outputPath);            
            workbook.write(outputStream);
            workbook.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static void writeData(List<String> data, String sheetName) {

        int rowNum = 0;
        XSSFSheet sheet = workbook.getSheet(sheetName);     
        sheet = workbook.createSheet(sheetName);


        for (int i = 0; i < data.size(); i++) {
            System.out.println(sheetName + " Processing line: " + i);
            int colNum = 0;
            // Split into value of cell
            String[] valuesOfLine = data.get(i).split(DATA_SEPERATOR);

            Row row = sheet.createRow(rowNum++);

            for (String valueOfCell : valuesOfLine) {
                Cell cell = row.createCell(colNum++);
                cell.setCellValue(valueOfCell);
            }
        }
    }

}

Then I get an error:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.xmlbeans.impl.store.Cur$Locations.(Cur.java:497) at org.apache.xmlbeans.impl.store.Locale.(Locale.java:168) at org.apache.xmlbeans.impl.store.Locale.getLocale(Locale.java:242) at org.apache.xmlbeans.impl.store.Locale.newInstance(Locale.java:593) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.newInstance(SchemaTypeLoaderBase.java:198) at org.apache.poi.POIXMLTypeLoader.newInstance(POIXMLTypeLoader.java:132) at org.openxmlformats.schemas.spreadsheetml.x2006.main.CTRst$Factory.newInstance(Unknown Source) at org.apache.poi.xssf.usermodel.XSSFRichTextString.(XSSFRichTextString.java:87) at org.apache.poi.xssf.usermodel.XSSFCell.setCellValue(XSSFCell.java:417) at ups.mongo.excelutil.OutputToExcelUtils.writeData(OutputToExcelUtils.java:80) at ups.mongo.excelutil.OutputToExcelUtils.clusterOutToExcel(OutputToExcelUtils.java:30) at ups.mongodb.App.main(App.java:74)

Please give me some advice for that?

Thank you with my respect.

Update solution: Using SXSSWorkbook instead of XSSWorkbook

public class OutputToExcelUtils {

    private static SXSSFWorkbook workbook;
    private static final String DATA_SEPERATOR = "!";

    public static void clusterOutToExcel(ClusterOutput clusterObject, ClusterOutputTrade clusterOutputTrade,
            ClusterOutputDistance ClusterOutputDistance, String outputPath) {

        workbook = new SXSSFWorkbook();
        workbook.setCompressTempFiles(true);
        FileOutputStream outputStream = null;

        writeData(clusterOutputTrade.getTrades(), "Data");

        try {
            outputStream = new FileOutputStream(outputPath);            
            workbook.write(outputStream);
            workbook.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static void writeData(List<String> data, String sheetName) {

        int rowNum = 0;
        SXSSFSheet sheet = workbook.createSheet(sheetName);
        sheet.setRandomAccessWindowSize(100); // For 100 rows saved in memory, it will flushed after wirtten to excel file

        for (int i = 0; i < data.size(); i++) {
            System.out.println(sheetName + " Processing line: " + i);
            int colNum = 0;
            // Split into value of cell
            String[] valuesOfLine = data.get(i).split(DATA_SEPERATOR);

            Row row = sheet.createRow(rowNum++);

            for (String valueOfCell : valuesOfLine) {
                Cell cell = row.createCell(colNum++);
                cell.setCellValue(valueOfCell);
            }
        }
    }

}
Hana
  • 824
  • 4
  • 14
  • 30
  • If possible, don't keep all the data in memory at once. Otherwise allow your JVM to use more memory. – Henry Mar 02 '18 at 10:15
  • 1
    do you need to have excel file? can you generate tsv/csv file which can be opened in most spreadsheets? – user902383 Mar 02 '18 at 10:19
  • Thanks Henry, your idea is the same as Alex below. Because the excel file have mant sheets so It can not be generated as a CSV file. – Hana Mar 02 '18 at 10:23
  • As mentioned in a comment on [this SO article](https://stackoverflow.com/questions/7274076/writing-a-large-resultset-to-an-excel-file-using-poi): "**SXSSF allocates temporary files that you must always clean up explicitly, by calling the dispose method.** Please see [SXSSF documentation.](http://poi.apache.org/components/spreadsheet/how-to.html#sxssf)" – KeaganFouche Apr 08 '19 at 12:37

3 Answers3

4

Your application is spending too much time doing garbage collection. This doesn't necessarily mean that it is running out of heap space; however, it spends too much time in GC relative to performing actual work, so the Java runtime shuts it down.

Try to enable throughput collection with the following JVM option:

-XX:+UseParallelGC

While you're at it, give your application as much heap space as possible:

-Xms????m

(where ???? stands for the amount of heap space in MB, e.g. -Xms8192m)

If this doesn't help, try to set a more lenient throughput goal with this option:

-XX:GCTimeRatio=19 

This specifies that your application should do 19 times more useful work than GC-related work, i.e. it allows the GC to consume up to 5% of the processor time (I believe the stricter 1% default goal may be causing the above runtime error)

No guarantee that his will work. Can you check and post back so others who experience similar problems may benefit?

EDIT

Your root problem remains the fact that you need to hold the entire spreadhseet and all its related objects in memory while you are building it. Another solution would be to serialize the data, i.e. writing the actual spreadsheet file instead of constructing it in memory and saving it at the end. However, this requires reading up on the XLXS format and creating a custom solution.

Another option would be looking for a less memory-intensive library (if one exists). Possible alternatives to POI are JExcelAPI (open source) and Aspose.Cells (commercial).

I've used JExcelAPI years ago and had a positive experience (however, it appears that it is much less actively maintained than POI, so may no longer be the best choice).

EDIT 2

Looks like POI offers a streaming model (https://poi.apache.org/spreadsheet/how-to.html#sxssf), so this may be the best overall approach.

Tony the Pony
  • 40,327
  • 71
  • 187
  • 281
  • Thank you Tony, I will try this way. But It seem people said that increase heap space to solve this is not a good idea. – Hana Mar 02 '18 at 10:56
  • @Hana The problem is not holding the 200,000 lines of text in memory; the problem is the spreadsheet and all its related objects, which consumes anywhere from 10-100 times the amount of memory of the source data -- I'll amend my answer with some other ideas – Tony the Pony Mar 02 '18 at 11:02
  • Yes I think we should flush data after it writen to excel. So, the memory will not take it in memory. You can take a look in this pic: https://i.stack.imgur.com/ZnYda.png Another version of XSSF is SXSSF can work more efficiently. – Hana Mar 02 '18 at 11:10
  • You have +1 from me because you have nailed the actual reason for the GC. Which is the excel library holding its state in memory. I think though that the answer is too big for the actual solution. I don't think the GC tunning options are nessesary here. – Alexander Petrov Mar 02 '18 at 11:52
1

Well try to not load all the data in memory. Even if the binary representation of 200k lines is not that big the hidrated object in memory may be too big. Just as a hint if you have a Pojo each attribute in this pojo has a pointer and each pointer depending on if it is compressed or not compressed will take 4 or 8 bytes. This mean that if your data is a Pojo with 4 attributes only for the pointers you will be spending 200 000* 4bytes(or 8 bytes).

Theoreticaly you can increase the amount of memory to the JVM, but this is not a good solution, or more precisly it is not a good solution for a Live system. For a non interactive system might be fine.

Hint: Use -Xmx -Xms jvm arguments to control the heap size.

Alexander Petrov
  • 9,204
  • 31
  • 70
  • Thank you Alex, I see there are 2 points in your answer. 1. I'm not using POJO for the data, they're pure text loaded from MongoDB, so I do not control the attributes. 2. As you said it's not good idea to increase JVM's memory. The system are going to take interative from user. So It might be not good solution. – Hana Mar 02 '18 at 10:19
  • Yeah I am trying to explain you why objects one created in heap actualy consume significantly more memory than what they consume on the hard disk. And why you should not try to solve this by juist flipping up the Heap. – Alexander Petrov Mar 02 '18 at 10:22
  • Thanks Alex, I got your mind! I was trying to load small portions of data to avoid JVM overloaded. But it still doesn't work. – Hana Mar 02 '18 at 10:23
  • This is a good idea in general, but the OP's problem is caused by excessive GC, not necessarily insufficient memory. But -Xms is always a good idea for memory-intensive apps, so +1 – Tony the Pony Mar 02 '18 at 10:53
0

Instead of getting the entire list from the data, iterate line wise. If too cumbersome, write the list to a file, and reread it linewise, for instance as a Stream<String>:

  Path path = Files.createTempFile(...);
  Files.write(path, list, StandarCharsets.UTF_8);
  Files.lines(path, StandarCharsets.UTF_8)
     .forEach(line -> { ... });

On the Excel side: though xlsx uses shared strings, if XSSF was done careless, the following would use a single String instance for repeated string values.

public class StringCache {
    private static final int MAX_LENGTH = 40;
    private Map<String, String> identityMap = new Map<>();

    public String cached(String s) {
         if (s == null) {
             return null;
         }
         if (s.length() > MAX_LENGTH) {
             return s;
         }
         String t = identityMap.get(s);
         if (t == null) {
             t = s;
             identityMap.put(t, t);
         }
         return t;
    }
}

StringCache strings = new StringCache();

       for (String valueOfCell : valuesOfLine) {
            Cell cell = row.createCell(colNum++);
            cell.setCellValue(strings.cached(valueOfCell));
       }
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • `new Map<>();`? – Eugene Mar 02 '18 at 11:04
  • `"a;a;a;a".split(";")` would create 4 String objects. My cache would reduce that to 1. However be warned, it does not need to work here, if the result is concatenated like `"" + a + ""`. – Joop Eggen Mar 02 '18 at 11:44