0

I'm trying to count number of words in file. The following code is working fine with .txt file. But When I try to read .doc docx .xls files. Its give me wrong output. Please suggest me any third party plugin. Please help me . thanks

$str = file_get_contents($path);

function count_words($string)
{
    $string = htmlspecialchars_decode(strip_tags($string));
    if (strlen($string)==0)
        return 0;
    $t = array(' '=>1, '_'=>1, "\x20"=>1, "\xA0"=>1, "\x0A"=>1, "\x0D"=>1, "\x09"=>1, "\x0B"=>1, "\x2E"=>1, "\t"=>1, '='=>1, '+'=>1, '-'=>1, '*'=>1, '/'=>1, '\\'=>1, ','=>1, '.'=>1, ';'=>1, ':'=>1, '"'=>1, '\''=>1, '['=>1, ']'=>1, '{'=>1, '}'=>1, '('=>1, ')'=>1, '<'=>1, '>'=>1, '&'=>1, '%'=>1, '$'=>1, '@'=>1, '#'=>1, '^'=>1, '!'=>1, '?'=>1); // separators
    $count= isset($t[$string[0]])? 0:1;
    if (strlen($string)==1)
        return $count;
    for ($i=1;$i<strlen($string);$i++)
        if (isset($t[$string[$i-1]]) && !isset($t[$string[$i]])) // if new word starts
            $count++;
    return $count;
}
    echo count_words($str);
no_freedom
  • 1,963
  • 10
  • 30
  • 48
  • Office formats are much more complex than text files. They do not contain words in any kind of clear text. Extracting text from those formats is a non-trivial task. I'll look for a duplicate... – Pekka Sep 07 '11 at 08:07
  • Here's some advice: [PHP Read and Write in MS WORD](http://stackoverflow.com/q/5052292) – Pekka Sep 07 '11 at 08:09
  • @pekka Is It possible store all the words into array. Then count number of items in array?? – no_freedom Sep 07 '11 at 08:19
  • yes, I suppose that is possible. – Pekka Sep 07 '11 at 08:19

2 Answers2

1

if you run on linux try this :

system("wc -w " . $filename); 
Haim Evgi
  • 123,187
  • 45
  • 217
  • 223
  • in this link http://pubs.opengroup.org/onlinepubs/9699919799/utilities/wc.html#tag_20_154_10 , they wrote The input files may be of any type, but you need to try it – Haim Evgi Sep 07 '11 at 08:31
0

I am working in the same issues with you. All you need to do is parse the .doc docx .xls file in the right way. Then use the count_words

private function read_docx(){

    $striped_content = '';
    $content = '';

    $zip = zip_open($this->filename);

    if (!$zip || is_numeric($zip)) return false;

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

        if (zip_entry_name($zip_entry) != "word/document.xml") continue;

        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

        zip_entry_close($zip_entry);
    }// end while

    zip_close($zip);

    $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
    $content = str_replace('</w:r></w:p>', "\r\n", $content);
    $striped_content = strip_tags($content);

    return $striped_content;
}