count characters in DOC and DOCX with PHP LINUX

Question

ADDITION: I have found that the closest method of counting lines is by using linux command "antiword" for DOC files, antiword would return a text version of the DOC; while for DOCX using a call that will retreive content from the DOCX and push data through the same text function as antiword.

The problem comes now when you have tables in the file, antiword adds a lot of white spaces.

===

I have a script that works out character count within DOCX files:

$zip = new ZipArchive;


$striped_content = '';
$content = '';

if(!$filename || !file_exists($filename)) return false;

$zip = zip_open($filename);

if (!$zip || is_numeric($zip)) return false;

while ($zip_entry = zip_read($zip)) {

    if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

    if (zip_entry_name($zip_entry) != "word/document.xml") continue;

    $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

    zip_entry_close($zip_entry);
}// end while

zip_close($zip_entry);

$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$striped_content = trim(strip_tags($content));

If I have doc file I basically convert file to docx using LibreOffice command line and than I run the script above.

The problem is that I am unable to find out how many words file has within "HEADER" and "FOOTER" area. How can this be accomplished?

My server runs: PHP 5.3 LibreOffice CentOS 6.5

I am not sure on what other information I need to supply, thank you for your answers before hand.

p.s.

I have tried converting doc and docx to txt, but in result the "HEADER" and "FOOTER" areas were not kept within txt document

Also, the closest solution that I have found is: https://github.com/nagilum/DOCx

Library breaks up whole docx file and you have header, content and footer in plain text and I can try to workout word count from their. However, libreoffice seem to badly convert files to docx sometimes and a doc file with 1 page may have 2 pages in docx, after convert.

score 0 · Answer 1 · answered Feb 10 '15 at 21:32

0

Any *.docx file -- zip archive. It consists app.xml file, where you can find node:

<Characters>8657</Characters>

and extract the value by regular expression

answered Feb 10 '15 at 21:32

Ruben Kazumov

3,803
2
26
39

hey, thanks for your reply. In my situation, it might be different for others, the "Characters" or "CharactersWithSpaces" tags include only "Content" area of the file and exlucde "header" and "footer". With "header" I should have 700 chars but without it, which is what i see in "CharactersWithSpaces", is 500. – Vlad Vladimir Hercules Feb 10 '15 at 21:41

count characters in DOC and DOCX with PHP LINUX

1 Answers1