1

I'm writing some code where I rely on the file utility to determine the file type of arbitrary files, typically audio files. For the most part, it works great, an ogg file for example might give the following output:

Ogg data, Vorbis audio, mono, 44100 Hz, ~80000 bps, created by: Xiph.Org libVorbis I (1.0.1)

A simple regexp can classify this as ogg vorbis. But for some other file types, file tries to get clever, an nsf (NES sound format) file for example, can yield this output:

NES Sound File ("The Legend of Zelda" by Konchano, copyright 1987 Nintendo), version 1, 8 tracks, NTSC

"NES Sound File" is clear enough, but it is followed by a string of unstructured data that is clearly just copied from the file itself. A malicious user could create an nsf file where this string is replaced by something like "Ogg data, Vorbis audio", making classification a lot harder.

Now let's say I fix this by discarding anything within parentheses (ignoring the fact that the title of the track could itself contain parentheses), along comes a Protracker module:

4-channel Protracker module sound data Title: "space_debris"

Again, untrusted data straight from the file, in a different position, now with the prefix "Title:". I can attempt to filter it out but really this is becoming a hassle.

I'm not finding any help in the man page. Is there really no way to tell file not to mix these unsafe strings into its output? Or is file simply not the right tool for this job?

0 Answers0