1

I am using boost::filesystem to search and process files in a directory. But instead of processing every file (checked by using boost::filesystem::is_regular_file()) I want to only process text files (or at least ignore binary files).

Is there a way I can achieve that even if files do not have an extension?

I would highly appreciate a platform independent solution.

Nemo
  • 2,441
  • 2
  • 29
  • 63
Paddre
  • 798
  • 1
  • 9
  • 19
  • 4
    Check the first 100 bytes or so of each file for non-textual bytes. Every binary file has some. Or, simply make that a part of your check during processing, and abandon files when you encounter binary bytes. – Robert Harvey Jan 05 '15 at 16:37
  • @RobertHarvey Yep, and so do many text files. E.g. the Unicode BOM, or just random non-ASCII characters in UTF-8 or ISO-8859 or some other encoding. At the very least you need some threshold, say 90% of "textual" (<127?) bytes. – Thomas Jan 05 '15 at 16:39
  • Well, the other way to do it is to identify every possible file type that is *not* a text file. Most binary files and document formats have some sort of magic string or other signature. Personally, I think it's easier just to identify the file as either text or something else. – Robert Harvey Jan 05 '15 at 16:42
  • @Robert Harvey: Since I don't care which type exactly a file has (as long as it is a text file) I think I should look at the first few bytes as you proposed. For the rest (that is if any further distinction is necessary) I would assume that files are distinguishable by their extensions. Assuming the desired files must only contain UTF-8 Characters: How can I make a good guess whether the file is a text file or not? – Paddre Jan 05 '15 at 17:25

4 Answers4

4

Use libmagic.

Libmagic is available on all major platforms (and many minors).

#include <boost/filesystem.hpp>
#include <boost/range.hpp>
#include <iostream>
#include <magic.h>

using namespace boost;
namespace fs = filesystem;

int main() {
    auto handle = ::magic_open(MAGIC_NONE|MAGIC_COMPRESS);
    ::magic_load(handle, NULL);

    for (fs::directory_entry const& x : make_iterator_range(fs::directory_iterator("."), {})) {
        auto type = ::magic_file(handle, x.path().native().c_str());
        std::cout << x.path() << "\t" << (type? type : "UNKOWN") << "\n";
    }

    ::magic_close(handle);
}

Prints, e.g.

sehe@desktop:~/custom/boost/status$ /tmp/test 
"./Jamfile.v2"  ASCII text
"./explicit-failures.xsd"   XML document text
"./expected_results.xml"    XML document text
"./explicit-failures-markup.xml"    XML document text

You can use the flags to control the detail of classification, e.g. MAGIC_MIME:

sehe@desktop:~/custom/boost/status$ /tmp/test 
"./Jamfile.v2"  text/plain; charset=us-ascii
"./explicit-failures.xsd"   application/xml; charset=us-ascii
"./expected_results.xml"    application/xml; charset=us-ascii
"./explicit-failures-markup.xml"    application/xml; charset=utf-8

Or loading just /etc/magic:

sehe@desktop:~/custom/boost/status$ /tmp/test 
"./Jamfile.v2"  ASCII text
"./explicit-failures.xsd"   ASCII text
"./expected_results.xml"    ASCII text, with very long lines
"./explicit-failures-markup.xml"    UTF-8 Unicode text
sehe
  • 374,641
  • 47
  • 450
  • 633
  • Looks promising. I'll give it a try – Paddre Jan 05 '15 at 20:17
  • 1
    I think this is exactly what I was looking for :-) Thanks! I modified it in the manner, that I added the `MAGIC_NO_CHECK_ASCII` flag to `magic_open()` and check the variable `type` for being equal to "data". – Paddre Jan 05 '15 at 21:22
  • I have to add, that this method is very expensive in terms of performance. I ran `callgrind` and found out, that 60% of my programs costs are caused by `magic_file`. Since I have to compare very many files I'll try the "guessing" approach by parsing just a few lines of each file and see whether it runs faster (however, I don't expect it to be faster ;-) ) – Paddre Jan 29 '15 at 19:56
  • Also I have to add that the check `type=="data"` works for most (almost all) files I've tested. There are however files that are recognized as type `data` but are in fact binary. – Paddre Feb 18 '15 at 20:33
  • "data" implies binary. Of course when a "binary" file just contains `0x00 0x20` pairs, it will be recognized as text... (and realistically, it _is_ text, even though it can be interpreted as binary) – sehe Feb 18 '15 at 21:26
2

There is no perfect solution.

You can do an educated guess, inspecting the content of the file. Text files often contain just printable ASCII text, which gives you some hint, but they might contain misleading UTF8 sequences if, for example, the text is written in hieroglyphs. Many files formats contain magical words in their headers, but there is no common convention about where that magic word is to find, thus you can easily construct a file containing the magical words of 5 different formats, all in their right place.

Sometimes it's really hard to decide what type of a file you see:

cat =13 /*/ >/dev/null 2>&1; echo "Hello, world!"; exit
*
*  This program works under cc, f77, and /bin/sh.
*
*/; main() {
      write(
cat-~-cat
     /*,'(
*/
     ,"Hello, world!"
     ,
cat); putchar(~-~-~-cat); } /*
     ,)')
      end
*/

Is that a sh-script, C source code or f77 source code?

I suggest you have a deep look in the source of the command file, which does the best effort to do what you try to do.

Hans Klünder
  • 2,176
  • 12
  • 8
  • In answer to your question "is that a sh-script, C source code or f77 source code?"... It is most definitely *text.* – Robert Harvey Jan 05 '15 at 16:59
  • Robert, you are completely right in both your comments, and especially command files are quite hard to classify, as everybody can make up their own language to be used in a command file. – Hans Klünder Jan 05 '15 at 17:02
  • I think Robert Harvey was confused by the expression "command file" before you highlighted the "file" ;-) I got your point. Since most of the files I will process are source code files, I can assume that the files that are interesting for me contain solely UTF-8 characters. The specific type of file is not important. More important is that my program doesn't begin to process binary files. – Paddre Jan 05 '15 at 17:32
  • @HansKlünder I suggest you just take a look at my answer then :) libmagic has been a separate thing since forever (like one doesn't suggest to look at the source of bash for input line editing; you'd look at `libreadline`) – sehe Jan 05 '15 at 18:25
  • By the way, here's what my little test program says for your contrived input: http://paste.ubuntu.com/9678058/ – sehe Jan 05 '15 at 18:35
1

You could steal from less. less considers a file a binary file if more than 5 characters in the first 256 byte are !isprint(c) && !iscntrl(c) in the current locale.

This too, is a heuristic (which is why less always says "this may be a binary file"), but it is a common one that usually works, and you can adjust the threshold if you're having trouble with some files.

Wintermute
  • 42,983
  • 5
  • 77
  • 80
0

Using libmagic , you can find the type of file . man libmagic will give the detailed info.

Go through the example

 ` magic_t myt = magic_open(MAGIC_NONE);
  sprintf(fullfilename, "%s/%s", dir_name,filename);
  magic_load(myt,NULL);
  printf("file type is  %s", magic_file(myt,fullfilename));
  magic_close(myt);
 `