Edit: i have received a few very good suggestions i will try to work through them and accept an answer at some point
I have a large list of strings (800k) that i would like to filter in the quickest time possible for a list of unwanted words (ultimately profanity but could be anything).
the result i would ultimately like to see would be a list such as
Hello,World,My,Name,Is,Yakyb,Shell
would become
World,My,Name,Is,Yakyb
after being checked against
Hell,Heaven.
my code so far is
var words = items
.Distinct()
.AsParallel()
.Where(x => !WordContains(x, WordsUnwanted));
public static bool WordContains(string word, List<string> words)
{
for (int i = 0; i < words.Count(); i++)
{
if (word.Contains(words[i]))
{
return true;
}
}
return false;
}
this is currently taking about 2.3 seconds (9.5 w/o parallel) to process 800k words which as a one off is no big deal. however as a learning process is there a quicker way of processing?
the unwanted words list is 100 words long
none of the words contain punctuation or spaces
- step taken to remove duplicates in all lists
- step to see if working with array is quicker (it isn't) interestingly changing the parameter words to a string[] makes it 25% slower
- Step adding AsParallel() has reduced time to ~2.3 seconds