2

For a project, I convert Excel documents to JSON with in the Java application using Apache Poi. In the future, this task shall be done using AWS Lambda, because it currently it can take very long (up to 20sec) and has a high memory consumption.

Requirements:

  • Support XLS and XLSX format
  • Evaluate formulas (so streaming is not possible?)
  • up to 100.000 Rows but smaller than 1MB

With AWS Lamda I now can use, Java, Python or NodeJs. My question is: Is my Apache POI approach the way to go or are there more suitable frameworks? E.g. sheetjs seems to be a good candidate. I was not able to find a up to date performance comparison of such frameworks.

Sebastian
  • 424
  • 8
  • 25
  • 1
    How are you doing the conversion now? Only it sounds like you're doing it wrong.... Should be possible to convert quite quickly and low memory using Apache POI – Gagravarr Nov 16 '16 at 13:39
  • `for (int j = 0; j < r.getLastCellNum(); j++) { Cell cell = r.getCell(j, Row.MissingCellPolicy.RETURN_BLANK_AS_NULL); }` I iterate over all rows and cells and fill java beans on the fly. – Sebastian Nov 16 '16 at 13:45
  • 1
    That's the easy but high memory way. What happens if you use the [low level code](http://poi.apache.org/spreadsheet/how-to.html) instead? (Maybe copy the CSV examples) – Gagravarr Nov 16 '16 at 13:47
  • Do you mean the XSSF Event API? I did not use it, because I was under the impression it is for xlsx and does not support xls and also I need formula evaluation. – Sebastian Nov 16 '16 at 13:56
  • You'd need to write different code for XLS and XLSX. See the respective CSV converters for how – Gagravarr Nov 16 '16 at 14:52
  • This is an option if formulas are actually always evaluated and the result is available as you answered [here](http://stackoverflow.com/questions/5945959/how-to-evaluate-a-excel-formula-through-xssf-event-api) for xls and xlsx. I will give this a try! – Sebastian Nov 16 '16 at 15:12

1 Answers1

1

Give a shot to pyexcel_xlsx library in python. I have used this for converting xlsx to json. Sweet and simple one. And fast also as compared to other python libraries.

Sample code:

from pyexcel_xlsx import get_data;
import time;
import json;

data = get_data("RefinedProduct.xlsx")
sheetName = "Table 6b";

for i in range(0, len(data[sheetName])):
    for j in range(0, len(data[sheetName][i])):
        print("Row: " + str(i) + ", Column: " + str(j) + ", Value: "+ str(data[sheetName][i][j]));
Yash Mochi
  • 769
  • 6
  • 15