How to reliable tell the uploaded file type (text or binary)?

Question

I have an application where users should be able to upload a wide variety of files, but I need to know for each file, if I can safely display its textual representation as plain text.

Using python-magic like

m = Magic(mime=True).from_buffer(cgi.FieldStorage.file.read())

gives me the correct MIME type.

But sometimes, the MIME type for scripts is application/*, so simply looking for m.startswith('text/') is not enough.

Another site suggested using

m = Magic().from_buffer(cgi.FieldStorage.file.read())

and checking for 'text' in m.

Would the second approach be reliable enough for a collection of arbitrary file uploads or could someone give me another idea?

Thanks a lot.

If you have a reasonably well-defined set of criteria, I would steer away from `file` / magic as its heuristics sometimes misfire in peculiar ways. How about check that there are no long runs of unprintable characters, check that line lenghts are sane, and substitute anything that looks like HTML with entities before displaying? — tripleee, Aug 14 '12 at 07:30
@InbarRose, I wouldn't trust the user's filenames in this case... — moschlar, Aug 14 '12 at 07:33
@tripleee Checking for unprintable characters seems to be another question of faith. Do you have a tip for that? — moschlar, Aug 14 '12 at 07:36
@moschlar: check the line length and search for ascii bellow 0x20 in the first lines of the file - and never trust user input. — Paulo Scardine, Aug 14 '12 at 07:39
@PauloScardine: What, no newlines (0x0a/0x0d) or tabs (0x09) allowed? — Aaron Digulla, Aug 14 '12 at 07:48
Look for `string.printable`. http://docs.python.org/library/string.html#string.printable — tripleee, Aug 14 '12 at 07:55
Oh, if you need this for Unicode, have a look (with some prejudice) at http://stackoverflow.com/questions/92438/stripping-non-printable-characters-from-a-string-in-python — tripleee, Aug 14 '12 at 08:00

Aaron Digulla · Answer 1 · 2012-08-14T08:18:11.417

What is your goal? Do you want the real mime type? Is that important for security reasons? Or is it "nice to have"?

The problem is that the same file can have different mime types. When a script file has a proper #! header, python-magic can determine the script type and tell you. If the header is missing, text/plain might be the best you can get.

This means there is no general "will always work" magic solution (despite the name of the module). You will have to sit down and think what information you can get, what it means and how you want to treat it.

The secure solution would be to create a list of mime types that you accept and check them with:

allowed_mime_types = [ ... ]
if m in allowed_mime_types:

That means only perfect matches are accepted. It also means that your server will reject valid files which don't have the correct mime type for some reason (missing header, magic failed to recognize the file, you forgot to mention the mime type in your list).

Or to put it another way: Why do you check the mime type of the file if you don't really care?

[EDIT] When you say

I need to know for each file, if I can safely display its textual representation as plain text.

then this isn't as easy as it sounds. First of all, "text" files have no encoding stored in them, so you will need to know the encoding that the user used when they created the file. This isn't a trivial task. There are heuristics to do so but things get hairy when encodings like ISO 8859-1 and 8859-15 are used (the latter has the Euro symbol).

To fix this, you will need to force your users to either save the text files in a specific encoding (UTF-8 is currently the best choice) or you need to supply a form into which users will have to paste the text.

When using a form, the user can see whether the text is encoded correctly (they see it on the screen), they can fix any problems and you can make sure that the browser sends you the text encoded with UTF-8.

If you can't do that, your only choice is to check for any bytes below 0x20 in the input with the exception of \r, \n and \t. That is a pretty good check for "is this a text document".

But when users use umlauts (like when you write an application that is being used world wide), this approach will eventually fail unless you can enforce a specific encoding on the user's side (which you probably can't since you don't trust the user).

[EDIT2] Since you need this to check actual source code: If you want to make sure the source code is "safe", then parse it. Most languages allow to parse the code without actually executing it. That would give you some real information (because the parsers know what to look for) and you wouldn't need to make wild guesses :-)

The primary reason I need that information is really just to display the file's contents in a html textarea... — moschlar, Aug 14 '12 at 08:03
See my edits. You can check for binary characters but that will only get you half-way there. — Aaron Digulla, Aug 14 '12 at 08:05
I would think that the detection for `binary` as mime_encoding like I wrote in my answer does something incredibly similar! — moschlar, Aug 14 '12 at 08:06
You will laugh if I tell you the scope of my application: It's about uploading (or pasting - I want to support both methods) source code for automated testing. You wouldn't expect umlauts there, right? - Turns out that german students love them... -.- Anyway, if there was really only source code, I could simply hang on to that and ignore binary data. But the course teachers want to support PDF and picture submissions too... — moschlar, Aug 14 '12 at 08:08
If you want to make sure the source code is "safe", then parse it. Most languages allow to parse the code without actually executing it. That would give you some real information (because the parsers know what to look for) and you wouldn't need to make wild guesses :-) — Aaron Digulla, Aug 14 '12 at 08:13

moschlar · Answer 2 · 2012-08-14T08:52:46.607

0

After playing around a bit, I discovered that I can propably use the Magic(mime_encoding=True) results!

I ran a simple script on my Dropbox folder and grouped the results both by encoding and by extension to check for irregularities.

But it does seem pretty usable by looking for 'binary' in encoding.

I think I will hang on to that, but thank you all.

edited Aug 14 '12 at 08:52

answered Aug 14 '12 at 08:01

moschlar

1,286
11
18

This might work as long as your code is only used in the USA. – Aaron Digulla Aug 14 '12 at 08:06
Look at the output: `.tmp` is `unknown-8bit`. `.version` is `None`. Looking for `binary` isn't enough. Trust me, I've written code for international clients; there is no simple solution. 50 years of "simple solutions" created a mess which makes sure of that. :-) – Aaron Digulla Aug 14 '12 at 08:11
You should really group that on the second column to see what different types you get and print the file extensions as a list. Also: Check what your code prints when one file extension produces two different mime types. – Aaron Digulla Aug 14 '12 at 08:15
I hacked something together and ran it on my Dropbox folder, here's the results: https://gist.github.com/3347601#file_types.txt This looks quite good to me. There are some bad boys, e.g. the ``.pdf``, but it's the best, simplest and universal solution I got so far... – moschlar Aug 14 '12 at 08:49

How to reliable tell the uploaded file type (text or binary)?

2 Answers2