3

In my project I have one requirement to show the number of pages in Word documents (.doc, .docx) files and number of sheets in Excel documents (.xls, .xlsx). I have tried to read the .docx file using Docx4j but the performance is very poor but I need just the word count and tried using Apache POI. I am getting an error, something like:

"trouble writing output: Too many methods: 94086; max is 65536. By package:" 

I want to know whether there is any paid/open source library available for android.

Tiborg
  • 2,304
  • 2
  • 26
  • 33
  • When you first use docx4j after starting the VM, the JAXB context has to load. This one-off operation takes more or less time, depending on underlying hardware. On Android tablets, it generally takes a while. – JasonPlutext Nov 20 '12 at 20:46
  • In the 5 years since this question was asked, Plutext has introduced a (commercial) PDF Converter, which can efficiently calculate the number of pages for you. See https://stackoverflow.com/a/49201664/1031689 – JasonPlutext Mar 10 '18 at 23:45

1 Answers1

2

There is just no way to show exact number of pages in MS Word file, because it will be different for different users. The exact number depends on printer settings, paper settings, fonts, available images, etc.

Still, you can do the following for binary files:

  • open file use POIFSFileSystem or NPOIFSFileSystem
  • extract only FileInformationBlock as it is done in the constructor HWPFDocumentCore
  • create DocumentProperties using information from FileInformationBlock as it is done in constuctor of HWPFDocument
  • get value of property cPg of DOP: DocumentProperties::getCPg()

The description of this field is: "A signed integer value that specifies the last calculated or estimated count of pages in the main document, depending on the values of fExactCWords and fIncludeSubdocsInStats."

For DOCX/XLSX documents you will need to access the same (I assume) property but using SAX or StAX methods.

vlsergey
  • 254
  • 1
  • 10
  • Whilst there might be some minor variation, it is reasonable to want to get the actual number of pages rendered (as opposed to the value in the document's metadata, which may only reflect the last time it was edited in Word). – JasonPlutext Mar 10 '18 at 23:47