0

Part of a list of projects I'm doing is a little text-editor.

At one point, you can load all the sub directories and files in a given directory. The program will add each as a node in a TreeView.

What I want the functionality to be is to only add the files that are readable by a normal text reader.

This code currently adds it to the tree:

TreeNode navNode = new TreeNode();
navNode.Text = file.Name;
navNode.Tag = file.FullName;

 directoryNode.Nodes.Add(navNode);

I know I could easily create an if statement with something like:

if(file.extension.equals(".txt"))

but I would have to expand that statement to contain every single extension that it could possibly be.

Is there an easier way to do this? I'm thinking it may have something to do with the mime types or file encoding.

Zeratas
  • 1,005
  • 3
  • 17
  • 36
  • It depends on what you mean by "readable by a normal text editor". Once you settle on that, your path will be more clear. For instance: *contains only ASCII characters* or *is a correctly-encoded UTF-8 file that contains only printable characters*. – Michael Petrotta Jan 22 '13 at 01:10
  • I'd say contained only ASCII characters and I can move from there. – Zeratas Jan 22 '13 at 01:12
  • 1
    There's no 100% way.. the best you'll get is a combination of extension and sampling the first ~1024 bytes of data to see if it meets your needs. – Simon Whitehead Jan 22 '13 at 01:13
  • What Simon said, sort of. To verify ASCII encoding, you could filter out bytes > 0x7F and those bytes representing control characters, but that'd be slow on large files (you'd be walking through every byte). Better to use some heuristics, like only looking at the first few kilobytes of the file. Be careful about limiting yourself to ASCII - Unicode is pretty prevalent now, and you'll find non-ASCII stuff out there where you least expect it. Don't roll your own equivalent for UTF-8 - it's too hard. Use .NET's built-in stuff - start in `System.Char`. – Michael Petrotta Jan 22 '13 at 01:22
  • I'll try to fool around a bit with some stuff. Thanks! – Zeratas Jan 22 '13 at 01:22
  • An alternative might be to display the first 32 characters or so in the tool tip when the cursor hovers over a node. The user can then make a SWAG (Scientific Wild Ass Guess) as to the readability. – HABO Jan 22 '13 at 01:24
  • What's wrong with a text file that contains text in Chinese? I don't think it will be using UTF-8 encoding. – John Saunders Jan 22 '13 at 01:57

4 Answers4

4

There is no general way of figuring type of information stored in the file.

Even if you know in advance that it is some sort of text if you don't know what encoding was used to create file you may not be able to load it properly.

Note that HTTP give you some hints on type of file by content-type header, but there is no such information on file system.

Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
1

There are a few methods you could use to "best guess" whether or not the file is a text file. Of course, the more encodings you support, the harder this becomes, especially if plan to support CJK (Chinese, Japanese, Korean) scripts. Let's just start with Encoding.Ascii and Encoding.UTF-8 for now.

Fortunately, most non-text files (executables, images, and the like) have a lot of non-parsable characters in their first couple of kilobytes.

What you could do is take a file and scan the first 1-4KB (up to you) and see if any "non-printable" characters come up. This operation shouldn't take much time and will at least give you some certainty of the contents of the file.

public static async Task<bool> IsValidTextFileAsync(string path,
                                                    int scanLength = 4096)
{
  using(var stream = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.Read))
  using(var reader = new StreamReader(stream, Encoding.UTF8))
  {
    var bufferLength = (int)Math.Min(scanLength, stream.Length);
    var buffer = new char[bufferLength];

    var bytesRead = await reader.ReadBlockAsync(buffer, 0, bufferLength);
    reader.Close();

    if(bytesRead != bufferLength)
      throw new IOException("There was an error reading from the file.");

    for(int i = 0; i < bytesRead; i++)
    {
      var c = buffer[i];

      if(char.IsControl(c))
        return false;
    }

    return true;
  }
}
Erik
  • 12,730
  • 5
  • 36
  • 42
  • No async and no scanLength for small files: `public static bool IsValidTextFile(string path) { using (var stream = System.IO.File.Open(path, System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read)) using (var reader = new System.IO.StreamReader(stream, System.Text.Encoding.UTF8)) { var bytesRead = reader.ReadToEnd(); reader.Close(); return bytesRead.All(c => !char.IsControl(c)); } }` – Rubenisme Jul 07 '16 at 15:16
1

My approach based on @Rubenisme's comment and @Erik's answer.

    public static bool IsValidTextFile(string path)
    {
        using (var stream = System.IO.File.Open(path, System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read))
        using (var reader = new System.IO.StreamReader(stream, System.Text.Encoding.UTF8)) 
        {
            var bytesRead = reader.ReadToEnd();
            reader.Close();
            return bytesRead.All(c => // Are all the characters either a:
                c == (char)10  // New line
                || c == (char)13 // Carriage Return
                || c == (char)11 // Tab
                || !char.IsControl(c) // Non-control (regular) character
                );
        }
    }
Qjimbo
  • 377
  • 2
  • 11
0

A hacky way to do it would be to see if the file contains any of the lower control characters (0-31) that aren't forms of white space (carriage return, tab, vertical tab, line feed, and just to be safe null and end of text). If it does, then it is probably binary. If it does not, it probably isn't. I haven't done any testing or anything to see what happens when applying this rule to non ASCII encodings, so you'd have to investigate further yourself :)

Patashu
  • 21,443
  • 3
  • 45
  • 53