0

I'm trying to create an application that searches through files, much like WindowsXP has. I'm using 4 threads that search through the specified directories and open every file to search for a string. This is done by calling a static method from a static class. The method then tries to find out the extension, and runs it through a private method depending on what extension is found. I've only created the possibility to read plain text files to the class. Here is the code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace Searcher
{
    static public class Searching 
    {
        static public bool Query(string file, string q)
        {
            file = file.ToLower();

            if (file.EndsWith(".txt")) // plain textfiles
            {
                return txt(file, q);
            } // #####################################
            else if (file.EndsWith(".doc"))
            {
                return false;
            } // #####################################
            else if (file.EndsWith(".dll")) // Ignore these
            {
                return false;
            }
            else if (file.EndsWith(".exe")) // Ignore these
            {
                return false;
            }
            else // will try reading as a textfile
            {
                return txt(file, q);
            }
        }

        static private bool txt(string file, string q)
        {
            string contents;
            TextReader read = new StreamReader(file);
            contents = read.ReadToEnd();
            read.Dispose();
            read.Close();

            return contents.ToLower().Contains(q);
        }

        static private bool docx(string file, string q)
        {
            return false;
        }
    }
}

Query reads the extension, and then forwards the processing. As I only included plain text files at this moment, not much can be chosen. Before the search begins I also tell my program that it needs to read all files possible.

Now my problem lies here, though the reader can only read plain text files, it also reads images and applications (.exe/.dll). This is expected as it tries to read everything. The weird thing though is that it returns with a match. I've searched the files in Notepad++ but there were no matches. I also pulled out the content by using breakpoints right after the file is read into the 'contents'-variable, and tried to search that, but again without a match. So this would mean that the content is not searched very well by the String.Contains() method, which seems to believe that the given query is in the file.

I hope someone knows what the problem could be. The string I searched for was "test", and the program works when searching textfiles.

Damodaran
  • 10,882
  • 10
  • 60
  • 81
Anthony Dekimpe
  • 357
  • 2
  • 14
  • 3
    It's not going to be a bug in string.Contains. Take one of the files where the unexpected match was found, and run your program just on that file. In the debugger, see what the contents of the (probably binary) file look like when converted to a string. There will most likely be something in there which is a valid match to the string "test". – Baldrick Oct 31 '13 at 04:54
  • How do you know "test" is not in your exe file? exe files can and do contain strings as well – Szymon Oct 31 '13 at 04:57
  • I tested your code using a text file and 2 excel files. It worked correct. It did not find the string 'test' in Excel files. Try to create an empty Excel file and test with it. – NoChance Oct 31 '13 at 04:59
  • I did that, but still no match. From the streamreader the data goes into the 'contents'-string, where it's converted to lowercase and then checked. I pulled the data (string format) out after the lowercase had been applied. I then tried searching it in notepad++ to find that there still isn't a match. – Anthony Dekimpe Oct 31 '13 at 04:59
  • Can you post the files that return the false positive? – Noctis Oct 31 '13 at 05:01
  • Found the solution, the files do contain the word test, but they are capitalized, and my settings weren't completely right from my notepad++. That aside, I'd like to know how to ignore these files, so they won't be searched as a plain-text file? – Anthony Dekimpe Oct 31 '13 at 05:08
  • Best way to filter unwanted files is to use filter lists for each type of file(.txt,.inf,.xml,.html),(.jpg,.bmp,.gif,.tiff) and only open those types – tinstaafl Oct 31 '13 at 05:22
  • There are so many file extensions. You could ignore them by only search files with .txt and/or by letting your user specify what file types you should search. – NoChance Oct 31 '13 at 05:22
  • Check this project for using Windows search in .NET applications - [IFilter on Codeplex](http://ifiltercore.codeplex.com/) – Harsh Baid Oct 31 '13 at 05:44
  • 1
    If you are working with ASCII characters, you could read the entire file, or portions of it, into byte arrays and look for non-ASCII characters. This won't work with Unicode and other character representations. Then you are likely to recognize non-text files. – David Rector Nov 06 '13 at 01:07

2 Answers2

0

Glad you found a solution.

I'd still like to see some of the offending "false positive" files to be able to have a look.

In the meanwhile, and a bit of a tangent, but still relevant, I'd change your txt function to :

private bool txt(string file, string q)
{
    string contents = "";
    using (TextReader read = new StreamReader(file))
    {
        contents = read.ReadToEnd();
    }

    return contents.ToLower().Contains(q);
}

Cleaner that way.

Edit :
Well, the reason they return true is because those files do contain the string "Test" in them, Specifically: [CCP_TEST RMCCPSearchValidateProductIDSetODBCFoldersAllocateRegistrySpaceNOT] in the MSI and [OnUpda teSt ring] in the dll. So, the String.Contains() is working properly.

So, back to filtering what you're searching for. Either give a list of known text endings, or let the user choose what he wants.

Some other things you might want to consider is only searching for exact words, so test won't be true in the case of OnUpdateString :)

Text extensions: on wiki , on fileinfo

Noctis
  • 11,507
  • 3
  • 43
  • 82
  • You're right, I'll change it. As for the false positives, I uploaded them temporarly to here: http://temp-share.com/show/3Yg87qk2x http://temp-share.com/show/2gFb92JS8 – Anthony Dekimpe Oct 31 '13 at 05:30
0

I tried for a .Dll and exe file , It worked fine for me. You are getting true because the value you are searching is present in the file. Try opening the file with notepad and search for the value.

also try searching for some other string like "eafrd" instead of test(which is a dictionary word which can be present in dll or exe files).It returned me false.

also see for any random word in the file which you opened in the notepad try searching for it.

Dalton
  • 1,334
  • 1
  • 11
  • 19