I'm currently working on a project which involves reading file
's magic files (without bindings). I'd like to know how it would be possible to read the file tests from the compiled binary magic.mgc
directly, in another language (like Go), as I'm unsure of how its contents should be interpreted.
Asked
Active
Viewed 794 times
1

David Castillo
- 4,266
- 4
- 23
- 25
-
1In principle, you can: the source of `file` is [online](https://github.com/file/file), and doesn't do anything you can't reimplement, and `man magic` describes what it's trying to do. The [text source](https://github.com/file/file/tree/master/magic/Magdir) used to generate `magic.mgc` is also online and may be easier to parse. This is a long way from really helping you do that, though. – twotwotwo Dec 15 '15 at 06:38
-
Thanks for your comment, @twotwotwo. Starting out, I'd like to find a way to use the existing `.mgc` file, rather than compiling it myself. If it turns out to be impossible, though, I guess I'd have to haha. – David Castillo Dec 15 '15 at 12:31
-
It looks like this is the actual .mgc compiler and parser, about 70kb and 3000 lines: https://github.com/file/file/blob/master/src/apprentice.c -- not totally impossible to reverse-engineer but it looks like a pain (like, `file` is really doing lots of match types internally). There is, incidentally, a very simple sniffer in https://golang.org/src/net/http/sniff.go – twotwotwo Dec 15 '15 at 19:14
-
I hadn't seen `sniff`! The `file` source code is complex indeed. See my answer, which I got from Christos Zoulas himself in the File mailing list. Thanks for the help! – David Castillo Dec 15 '15 at 20:14
1 Answers
2
According to Christos Zoulas, main contributor of file:
If you want to use them directly you need to understand the binary format (which changes over time) and load it in your own data structures. [...] The code that parses the file is in apprentice.c. See check_buffer() for the reader and apprentice_compile() for the writer. There is a 4 byte magic number, followed by a 4 byte version number followed by MAGIG_SET (2) number of 4 byte counts one for each set (ascii, binary) followed by an array of 'struct magic' entries, in native byte format.
So that's the format one should expect! Nevertheless, it has to be parsed just like the raw files.

A J
- 2,508
- 21
- 26

David Castillo
- 4,266
- 4
- 23
- 25