4

Edit: i have received a few very good suggestions i will try to work through them and accept an answer at some point

I have a large list of strings (800k) that i would like to filter in the quickest time possible for a list of unwanted words (ultimately profanity but could be anything).

the result i would ultimately like to see would be a list such as

Hello,World,My,Name,Is,Yakyb,Shell

would become

World,My,Name,Is,Yakyb

after being checked against

Hell,Heaven.

my code so far is

 var words = items
            .Distinct()
            .AsParallel()
            .Where(x => !WordContains(x, WordsUnwanted));

public static bool WordContains(string word, List<string> words)
    {
        for (int i = 0; i < words.Count(); i++)
        {
            if (word.Contains(words[i]))
            {
                return true;
            }
        }
        return false;
    }

this is currently taking about 2.3 seconds (9.5 w/o parallel) to process 800k words which as a one off is no big deal. however as a learning process is there a quicker way of processing?

the unwanted words list is 100 words long
none of the words contain punctuation or spaces

  1. step taken to remove duplicates in all lists
  2. step to see if working with array is quicker (it isn't) interestingly changing the parameter words to a string[] makes it 25% slower
  3. Step adding AsParallel() has reduced time to ~2.3 seconds
RoughPlace
  • 1,111
  • 1
  • 13
  • 23

5 Answers5

1

Try the method called Except.

http://msdn.microsoft.com/en-AU/library/system.linq.enumerable.except.aspx

var words = new List<string>() {"Hello","Hey","Cat"};
var filter = new List<string>() {"Cat"};

var filtered = words.Except(filter);

Also how about:

var words = new List<string>() {"Hello","Hey","cat"};
var filter = new List<string>() {"Cat"};
// Perhaps a Except() here to match exact strings without substrings first?
var filtered = words.Where(i=> !ContainsAny(i,filter)).AsParallel();    
// You could experiment with AsParallel() and see 
// if running the query parallel yields faster results on larger string[]
// AsParallel probably not worth the cost unless list is large
public bool ContainsAny(string str, IEnumerable<string> values)
{
   if (!string.IsNullOrEmpty(str) || values.Any())
   {
       foreach (string value in values)
       {
             // Ignore case comparison from @TimSchmelter
             if (str.IndexOf(value, StringComparison.OrdinalIgnoreCase) != -1) return true;

             //if(str.ToLowerInvariant().Contains(value.ToLowerInvariant()))
             // return true;
       }
   }

   return false;
}
Jeremy
  • 3,880
  • 3
  • 35
  • 42
  • Do you have information about the Big-O performance of `Except`? I did not see any on MSDN. – Eric J. Feb 22 '13 at 22:38
  • 3
    `Except` is probably not what he wants to use, because it only matches on exact strings, not substrings. – Andrew Mao Feb 22 '13 at 22:39
  • 1
    Reflector or use shared source to see what it does under the hood. – Jeremy Feb 22 '13 at 22:39
  • @AndrewMao you are right, there is no contains on that and no culture setting ability for the string comparison. – Jeremy Feb 22 '13 at 22:41
  • @EricJ. http://stackoverflow.com/questions/2799427/what-guarantees-are-there-on-the-run-time-complexity-big-o-of-linq-methods – Jeremy Feb 22 '13 at 22:53
  • @Jeremy Child I wasn't the one who asked :) – Andrew Mao Feb 22 '13 at 22:53
  • 1
    Use `if (str.IndexOf(value, StringComparison.OrdinalIgnoreCase) != -1) return true;` instead. Better, parameterize the `StringComparison`. – Tim Schmelter Feb 22 '13 at 22:55
  • Btw, here's the method as a one-liner: `values.Any(s => str.IndexOf(s, StringComparison.OrdinalIgnoreCase) != -1);` – Tim Schmelter Feb 22 '13 at 23:01
  • i like the one liner, however this method is filtering 25k less words than mine (which should be being filtered) performance is roughly equal – RoughPlace Feb 22 '13 at 23:08
  • @Yakyb I am using GuessWho to generate 600,000 names in one list and filtering out 50,000 other generated names from that list in the *same second*. The name list generation takes like 10 seconds but thats nothing to do with the filtering. https://github.com/caleb-vear/GuessWho – Jeremy Feb 22 '13 at 23:14
1

couple of things

Alteration 1 (nice and simple): I was able to speed the run (fractionally) by using HashSet over the Distinct method.

var words = new HashSet<string>(items) //this uses HashCodes
        .AsParallel()...

Alteration 2 (Bear with me ;) ) : regarding @Tim's comment, the contains may not provide you with enough for search for black listed words. For example Takeshita is a street name.

you have already identified that you would like the finite state (aka Stemmed) of the word. for example for Apples we would treat it as Apple. To do this we can use stemming algorithms such as the Porter Stemmer.

If we are to Stem a word then we may not need to do Contains(x), we can use the equals(x) or even better compare the HashCodes (the fastest way).

var filter = new HashSet<string>(
    new[] {"hello", "of", "this", "and", "for", "is", 
        "bye", "the", "see", "in", "an", 
        "top", "v", "t", "e", "a" }); 

var list = new HashSet<string> (items)
            .AsParallel()
            .Where(x => !filter.Contains(new PorterStemmer().Stem(x)))
            .ToList();

this will compare the words on their hash codes, int == int.

The use of the stemmer did not slowdown the speed as we complemented it with the HashSet (for the filtered list, bigO of 1). And this returned a larger list of results.

I am using the Porter Stemmer located in the Lucene.Net code, this is not threadsafe thus we new one up each time

Issue with Alteration 2, Alteration 2a: as with most Natural language processing, its not simple. What happens when

  1. the word is a combination of banned words "GrrArgh" (where Grr and Argh are banned)
  2. the word is spelt intentionally wrong "Frack", but still has the same meaning as a banned word (sorry to the forum ppl)
  3. the word is spelt with spaces "G r r".
  4. you the band word is not a word but a phrase, poor example: "son of a Barrel"

With forums, they use humans to fulfil these gaps.

Or the introduction of a white list is introduced (given that you have mention the bigO we can say this will have a performance hit of 2n^2, as we are doing 2 lists for every item, do not forget to remove the leading constaints and if i remember correctly you are left with n^2, but im a little rusty on my bigO)

dbones
  • 4,415
  • 3
  • 36
  • 52
1

Change your WordContains method to use a single Aho-Corasick search instead of ~100 Contains calls (and of course initialize the Aho-Corasick search tree just once).

You can find a open-sourced implementation here http://www.codeproject.com/script/Articles/ViewDownloads.aspx?aid=12383.

After initilization of the StringSearch class you will call the method public bool ContainsAny(string text) for each of your 800k strings.

A single call will take O(length of the string) time no matter how long your list of unwanted words is.

Tomas Grosup
  • 6,396
  • 3
  • 30
  • 44
0

Ah, filtering words based on matches from a "bad" list. This is a clbuttic problem that has tested the consbreastution of many programmers. My mate from Scunthorpe wrote a dissertation on it.

What you really want to avoid is a solution that tests a word in O(lm), where l is the length of the word to test and m is the number of bad words. In order to do this, you need a solution other than looping through the bad words. I had thought that a regular expression would solve this, but I forgot that typical implementations have an internal data structure that is increased at every alternation. As one of the other solutions says, Aho-Corasick is the algorithm that does this. The standard implementation finds all matches, yours would be more efficient since you could bail out at the first match. I think this provides a theoretically optimal solution.

bmm6o
  • 6,187
  • 3
  • 28
  • 55
0

I was interested to see if I could come up with a faster way of doing this - but I only managed one little optimization. That was to check the index of a string occuring within another because it firstly seems to be slightly faster than 'contains' and secondly lets you specify case insensitivity (if that is useful to you).

Included below is a test class I wrote - I have used >1 million words and am searching using a case sensitive test in all cases. Its tests your method, and also a regular expression I am trying to build up on the fly. You can try it for yourself and see the timings; the regular expression doesn't work as fast as the method you provided, but then I could be building it incorrectly. I use (?i) before (word1|word2...) to specify case insensitivity in a regular expression (I would love to find out how that could be optimised - it's probably suffering from the classic backtracking problem!).

The searching methods (be it regular expressions or the original method provided) seem to get progressivly slow as more 'unwanted' words are added.

Anyway - hope this simple test helps you out a bit:

    class Program
{


    static void Main(string[] args)
    {
        //Load your string here - I got war and peace from project guttenburg (http://www.gutenberg.org/ebooks/2600.txt.utf-8) and loaded twice to give 1.2 Million words
        List<string> loaded = File.ReadAllText(@"D:\Temp\2600.txt").Split(new string[] { " " }, StringSplitOptions.RemoveEmptyEntries).ToList();

        List<string> items = new List<string>();
        items.AddRange(loaded);
        items.AddRange(loaded);

        Console.WriteLine("Loaded {0} words", items.Count);

        Stopwatch sw = new Stopwatch();

        List<string> WordsUnwanted = new List<string> { "Hell", "Heaven", "and", "or", "big", "the", "when", "ur", "cat" };
        StringBuilder regexBuilder = new StringBuilder("(?i)(");

        foreach (string s in WordsUnwanted)
        {
            regexBuilder.Append(s);
            regexBuilder.Append("|");
        }
        regexBuilder.Replace("|", ")", regexBuilder.Length - 1, 1);
        string regularExpression = regexBuilder.ToString();
        Console.WriteLine(regularExpression);

        List<string> words = null;

        bool loop = true;

        while (loop)
        {
            Console.WriteLine("Enter test type - 1, 2, 3, 4 or Q to quit");
            ConsoleKeyInfo testType = Console.ReadKey();

            switch (testType.Key)
            {
                case ConsoleKey.D1:
                    sw.Reset();
                    sw.Start();
                    words = items
                        .Distinct()
                        .AsParallel()
                        .Where(x => !WordContains(x, WordsUnwanted)).ToList();

                    sw.Stop();
                    Console.WriteLine("Parallel (original) process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                    words = null;
                    break;

                case ConsoleKey.D2:
                    sw.Reset();
                    sw.Start();
                    words = items
                        .Distinct()
                        .Where(x => !WordContains(x, WordsUnwanted)).ToList();

                    sw.Stop();
                    Console.WriteLine("Non-Parallel (original) process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                    words = null;
                    break;

                case ConsoleKey.D3:
                    sw.Reset();
                    sw.Start();
                    words = items
                        .Distinct()
                        .AsParallel()
                        .Where(x => !Regex.IsMatch(x, regularExpression)).ToList();

                    sw.Stop();
                    Console.WriteLine("Non-Compiled regex (parallel) Process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                    words = null;
                    break;

                case ConsoleKey.D4:
                    sw.Reset();
                    sw.Start();
                    words = items
                        .Distinct()
                        .Where(x => !Regex.IsMatch(x, regularExpression)).ToList();

                    sw.Stop();
                    Console.WriteLine("Non-Compiled regex (non-parallel) Process took {0}ms and found {1} matching words", sw.ElapsedMilliseconds, words.Count);
                    words = null;
                    break;

                case ConsoleKey.Q:
                    loop = false;
                    break;

                default:
                    continue;
            }
        }
    }

    public static bool WordContains(string word, List<string> words)
    {
        for (int i = 0; i < words.Count(); i++)
        {
            //Found that this was a bit fater and also lets you check the casing...!
            //if (word.Contains(words[i]))
            if (word.IndexOf(words[i], StringComparison.InvariantCultureIgnoreCase) >= 0)
                return true;
        }
        return false;
    }
}
Jay
  • 9,561
  • 7
  • 51
  • 72