file_get_contents() returning invalid characters for uploaded word document

Question

I'm trying to get the first 1,000 characters from an uploaded text file. I'm doing:

if($file->simpletype=="document"){
    //get first 1000 chars in here
    $snippet = file_get_contents($_FILES['upload']['tmp_name'], false, null, -1, 1000);
    file_put_contents('/var/www/my_logs/log.log', $snippet);
    $file->snippet = $snippet;
}

This works fine for a .txt file and I can open and read the log.log file with gedit. However for .doc, .docx, .odt and .pdf files, file_get_contents() returns gibberish such as: PK\00\00\00\

I have tried another solution I found on stackoverflow:

function file_get_contents_utf8() {
    $content = file_get_contents($_FILES['upload']['tmp_name'], false, null, -1, 1000);
    return mb_convert_encoding($content, 'UTF-8',
             mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}

But I get the same results. Any ideas? Thanks!

If you try opening the full file in a text editor you will get the same problem. They are NOT text files to start with so will not suddenly look like it if you only take the first 1000 lines — Anigel, May 23 '13 at 11:49

score 2 · Accepted Answer · edited May 23 '17 at 10:25

2

You are trying to read text from files that don't use plain text formatting.

To read doc/docx files, you will need to use a tool like PHPDocx or http://phpword.codeplex.com.

For parsing PDFs, refer to the answer to this question.

edited May 23 '17 at 10:25

Community

1
1

answered May 23 '13 at 11:49

Jon Cairns

11,783
4
39
66

score 1 · Answer 2 · answered May 23 '13 at 11:48

This will never work with non plain text files. You need to get plain text from doc/pdf/odt documents first and then you can manipulate that text. Simply open any of these documents in simple text editor like Notepad and see their contents.

For Word documents you may start with http://phpword.codeplex.com/. Also look for other libraries which you can use to extract contents from these files.

file_get_contents() returning invalid characters for uploaded word document

2 Answers2