4

I have a chunk of code that I'm using to read MS Office Word documents.

It is reading only text not all the contents.

<?php
function read_file_docx($filename){

    $striped_content = '';
    $content = '';
    if(!$filename || !file_exists($filename)) return false;
    $zip = zip_open($filename);
    if (!$zip || is_numeric($zip)) return false;
while ($zip_entry = zip_read($zip)) {
        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;
        if (zip_entry_name($zip_entry) != "word/document.xml") continue;
        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
        zip_entry_close($zip_entry);
    }

    zip_close($zip);
    $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
    $content = str_replace('</w:r></w:p>', "\r\n", $content);
    $striped_content = strip_tags($content);

    return $striped_content;
}

$filename = "customers.docx";

$content = read_file_docx($filename);
if($content !== false) {

    echo nl2br($content);   
}
else {
    echo 'Couldn\'t the file. Please check that file.';
}

?>

I want to read images,graphs and all the contents as well and display it in a webpage.

Kara
  • 6,115
  • 16
  • 50
  • 57
Mahendra Jella
  • 5,450
  • 1
  • 33
  • 38
  • When you say read, what exactly do you mean? Are you trying to get the pieces of the document, or are you intending to present the document in its original format and structure? – Brad Nov 22 '13 at 18:47
  • I am intended to present the whole document with all the contents in it. – Mahendra Jella Nov 22 '13 at 18:49
  • The only way to reliably do that is to fire up a copy of Word and use its API. You could spend years working on this otherwise. Just ask the folks contributing to OpenOffice about that. :-D – Brad Nov 22 '13 at 18:52
  • Actually I have seen but I didn't got any API's to read word document . Can you suggest me, if you know any. – Mahendra Jella Nov 22 '13 at 18:55
  • Have you looked into converting the docs into a PDF and displaying the PDF on your site? – user2537383 Nov 22 '13 at 19:50
  • @user It was nice but I want to display it in a webpage not in adobe reader. – Mahendra Jella Nov 23 '13 at 08:56
  • What do you expect the return type of `read_file_docx()` to be? HTML? Perhaps you're better off using a DOC to HTML library: http://www.phplivedocx.org/ – Bailey Parker Nov 28 '13 at 08:57

3 Answers3

2

I think you should first change your doc documents to pdf with command line Open Office or Libre Office.

with Libre Office it would be:

libreoffice --headless --convert-to pdf your_file_name.doc

and then use pdf.js ( https://github.com/mozilla/pdf.js/ ) to display documents on your site ( you don't need adobe reader )

Here is another minimal example https://github.com/vivin/pdfjs-text-selection-demo ( read minimal.js file to understand how the pdf is inserted )

Second option is to convert doc into docx and use https://github.com/stephen-hardy/DOCX.js

Pawel Dubiel
  • 18,665
  • 3
  • 40
  • 58
1

In case you are trying to do the extraction of all document content and the conversion into a matching web display all by yourself, I suggest reading the format specifications by Microsoft.


If you're just looking for a convenient way of extracting the contents from an MS Word document, I would strongly suggest looking into a library that already handles the document processing and extraction.

There are 2 projects that I know of that are working on the processing of MS Office documents in PHP.

  • PHPOffice / PHPWord (I'm not sure how far the Word branch of the project has developed. The project originated in a smaller scale supporting only MS Excel, but they are now working on Word and PowerPoint as well)

  • PHPDocX (This is a split project. You can get an LGPL licensed version with a basic feature set or a commercial paid-for version which should handle most things you find in common word documents)

HTH

Mastacheata
  • 1,866
  • 2
  • 21
  • 32
  • @Mastacheta I tried this its supporting upto MS-word 2007 not higher versions any have thanks for suggestions – Mahendra Jella Nov 29 '13 at 11:37
  • @Mahendra is there anything specific you need that's only available in Office 2010 or 2013? I've always thought the file formats were both forward and backward compatible. – Mastacheata Nov 29 '13 at 17:46
1

You should check out Aspose Cloud. Its a service that allows you to convert docx to html

There is a PHP SDK for it on github.

There is a free option if you are converting less than 100 documents per month

good luck

Rafal
  • 181
  • 5