0

I want to find out the size of the content inside a docx,pptx etc. Is there any package which can be used for this? I googled and found that POI is used widely to read/write to MS file types. But not able to find the correct api to find the size of the file content. I want to know the actual content size not the compressed file size which can be seen from properties.

Finally i found the way, but it is throwing OOM exception if the file is too large.

OPCPackage opcPackage = OPCPackage.open(file.getAbsolutePath());
XWPFDocument doc = new XWPFDocument(opcPackage);
XWPFWordExtractor we = new XWPFWordExtractor(doc);
String paragraphs = we.getText();
System.out.println("Total Paragraphs: "+paragraphs.length() / 1024);

Please help me if there are any other better way to do this.

Cool
  • 35
  • 4
  • 1
    So the sum of lengths of each part in the package? In the case of XML parts, do you want the length to include element names eg w:p, or just the content of text nodes? Note that the length of an XML document can vary, depending on what namespace prefixes are used, where the namespaces are declared etc Also in Open XML, an attribute value might be true, 1 or on. – JasonPlutext Nov 27 '13 at 09:09
  • @JasonPlutext i want to find the size of text content alone. – Cool Nov 27 '13 at 09:22
  • Use POI to extract the text, then call `textString.length()` on it? – Gagravarr Nov 27 '13 at 12:34
  • Could you please give me some sample code or link? – Cool Nov 28 '13 at 11:19
  • I'm fairly sure that most paragraphs won't be exactly 1024 characters long... – Gagravarr Nov 28 '13 at 15:59
  • Very old question but this link might be useful for someone who is looking for answer - https://stackoverflow.com/a/58540126/341117 – Ravindra Gullapalli Aug 07 '20 at 12:57

1 Answers1

0

Ok this has been asked long time ago and there is also no response to this question. I have not used OPCPackage and hence my answer is not based on that.

DOCX (and for that matter PPTX as well as XSLX) files are all zip files having a particular structure. We could hence use the java.util.zip package and enumerate the entries of the zip file and get the size of the zip entry xl for xlsx file and word for docx files. Probably a more generic method would be to ignore the following top-level zip entries i.e. zip entries starting with :

  1. docProps
  2. _rels
  3. [Content_Types].xml

The size of the remaining zip entry (do not ignore any folder within this zip entry) would tell you the correct size of the content. This method is also very efficient - you only read the entries of the zip file and not the zip file itself hence obtaining the size information would run with negligible time and memory resources. For a quick start I was able to get the size of a 4MB docx file in fraction of a second.

A "good-enough" but not adequately working piece of code using this approach is pasted below. Please feel free to use this as a starting point and fix bugs if found. It would be great if you can post back the modifications or corrections so that others can benefit

    private static final void printUnzippedContentLength() throws IOException
    {
            ZipFile zf = new ZipFile(new File("/home/chaitra/verybigfile.docx"));

            Enumeration<? extends ZipEntry> entries = zf.entries();

            long sumBytes = 0L;
            while(entries.hasMoreElements())
            {
                ZipEntry ze = entries.nextElement();

                if(ze.getName().startsWith("docProps") || ze.getName().startsWith("_rels") || ze.getName().startsWith("[Content_Types].xml"))
                {
                    continue;
                }
                sumBytes += ze.getSize();


            }

            System.out.println("Uncompressed content  has size " +   (sumBytes/1024) + " KB" );
 }
Prahalad Deshpande
  • 4,709
  • 1
  • 20
  • 22