0

Is PDI inefficient in terms of writing excel xlsx file with Microsoft Excel Writer.

A transformed excel data file in Pentaho output seems to be three times the size, if the data was transformed manually. Is this inefficiency expected or is there a workaround for it.

A CSV file of the same transformed output is way smaller in size. Have I configured something wrong ?

  • Can you please give specific examples? In my small test the xlsx file created with PDI was 40% of the size of a similar file created with Excel. – bolav Mar 01 '16 at 13:00
  • Well a recent test case, the CSV file output was 5.7Mb, however the Excel writer ouput xlsx file was 8.9Mb. Well normally a xlsx file must be considerably smaller than the csv file. Could there be any configuration we might have to check for Microsoft excel writer in Pentaho. – user2033024 Mar 01 '16 at 13:36
  • can you list the contents of the zip file, to list file sizes and compression? – bolav Mar 01 '16 at 13:44
  • Folders - docProps, xl, _rels. File [Content_Types.xml]. Well in the test you performed did your file end up being 40% greater size or lesser – user2033024 Mar 07 '16 at 12:12
  • Excel created a file that was 40% bigger than Pentaho. – bolav Mar 07 '16 at 13:03

1 Answers1

1

xlsx files should normally be smaller in size than CSV, since they consist of XML data compressed in ZIP files. Pentaho's Microsoft Excel Writer uses org.apache.poi.xssf.streaming.SXSSFWorkbook and org.apache.poi.xssf.usermodel.XSSFWorkbook to write xlsx files, and they create compressed files so this should not be your issue.

To check the files you could check with a zip utility, to see the file sizes and compression rate, to see if there is a bug. You could also try to open the file in Excel and re-save it, to see if that gives a smaller size, which could indicate an inefficiency.

bolav
  • 6,938
  • 2
  • 18
  • 42