3

I have a large amount of files where their original file names have been replaced by ids from my database. For example, what was once name word_document.doc is now 12345. Through a process I have lost the original name.

I am now trying to present these files for download. The person should be able to download the file and view it using it's original application. The files are all in one of the following formats:

  • .txt (text)
  • .doc (word document)
  • .docx (word document)
  • .wpd (word perfect)
  • .pdf (PDF)
  • .rtf (rich text)
  • .sxw (star office)
  • .odt (open office)

I'm using

$fhandle = finfo_open(FILEINFO_MIME);
$file_mime_type = finfo_file($fhandle, $filepath);

to get the mime type and then mapping the mime type to an extension.

The problem I am running into is some of the files have a mime type of octet-stream. I've read online and this type seems to be a miscellaneous type for binary files. I can't easily tell what the extension needs to be. In some cases it works when I set it to .wpd and some cases it doesn't. The same goes for .sxw.

Caleb Doucet
  • 1,751
  • 2
  • 14
  • 29
  • Lol, think main phrase in your post - 'Through a process I have lost the original name'. You are already keep some info in database, why you not keep filenames in database too? – degr Jul 14 '15 at 14:59
  • Maby this will help you? http://tika.apache.org/ – sanderbee Jul 14 '15 at 15:00
  • @degr I do keep filenames in the database, but users are allowed to "delete" their files. "Deleting" is simply removing the row in the database that holds information such as the filename. As part of the website we need to keep the files and have them still accessible as the files are now owned by others. – Caleb Doucet Jul 14 '15 at 15:05
  • @Caleb Doucet You need to delete file with row from database. If you need to keep files, you can keep row in database too, just add one more 'bit' field named - deleted. – degr Jul 14 '15 at 15:08
  • @degr I understand the solution would be to just keep the database record but that would require a lot of rework. (it is a big system) The budget won't allow for what you are proposing. – Caleb Doucet Jul 14 '15 at 15:11
  • Also, when you generate your new file name, that match row in database, you can also keep some metadata near your file. For an example, you have file 12341, and row in database with id 12341. You can do this : file_put_contents(12341.'.metadata', serialize(database->getRowById(12341))). This is ugly, but work perfectly. – degr Jul 14 '15 at 15:11
  • Unfortunately I don't know another solutions – degr Jul 14 '15 at 15:12

1 Answers1

2

Symfony2 do it in 3 steps

1) mime_content_type

$type = mime_content_type($path);

// remove charset (added as of PHP 5.3)
if (false !== $pos = strpos($type, ';')) {
    $type = substr($type, 0, $pos);
}

return $type;

2) file -b --mime

ob_start();
passthru(sprintf('file -b --mime %s 2>/dev/null', escapeshellarg($path)), $return);
if ($return > 0) {
    ob_end_clean();

    return;
}

$type = trim(ob_get_clean());
if (!preg_match('#^([a-z0-9\-]+/[a-z0-9\-\.]+)#i', $type, $match)) {
    // it's not a type, but an error message
    return;
}

return $match[1];

3) finfo

if (!$finfo = new \finfo(FILEINFO_MIME_TYPE, $path)) {
    return;
}

return $finfo->file($path);

After you've got mime-type you can get extension from predefined map, for example from here or here

$map = array(
    'application/msword' => 'doc',
    'application/x-msword' => 'doc',
    'application/vnd.openxmlformats-officedocument.wordprocessingml.document' => 'docx',
    'application/pdf' => 'pdf',
    'application/x-pdf' => 'pdf',
    'application/rtf' => 'rtf',
    'text/rtf' => 'rtf',
    'application/vnd.sun.xml.writer' => 'sxw',
    'application/vnd.oasis.opendocument.text' => 'odt',
    'text/plain' => 'txt',
);
Nikita U.
  • 3,540
  • 1
  • 28
  • 37
  • These are great ways to get the mime type from a file path but I am already retrieving the mime type. I need to know what to do to parse octet-stream mime types to the appropriate extension. – Caleb Doucet Jul 14 '15 at 16:18
  • Well, I don't think there is 100% way to determine extension, but combining this 3 methods should make a great job. Sometimes 95% automating is better than nothing. Other 5% you can handle manually. There is a big chance that they have the same extension:) – Nikita U. Jul 14 '15 at 16:53