remove stop words from text C#

Question

i want to remove an array of stop words from input string, and I have the following procedure

string[] arrToCheck = new string[] { "try ", "yourself", "before " };

string input = "Did you try this yourself before asking";
foreach (string word in arrToCheck )
{
input = input.Replace(word, "");
}

Is it the best way to conduct this task, specially when I have (450) stop words and the input string is long? I prefer using replace method, because I want to remove the stop words when they appear in different morphologies. For example, if the stop word is "do" then delete "do" from (doing, does and so on ). are there any suggestions for better and fastest processing? thanks in advance.

Also have a look at the following link http://stackoverflow.com/questions/4763611/replace-multiple-words-in-string — Dot_Refresh, May 04 '12 at 11:44

score 4 · Accepted Answer · answered May 04 '12 at 11:41

4

May I suggest a StringBuilder?

http://msdn.microsoft.com/en-us/library/system.text.stringbuilder.aspx

string[] arrToCheck = new string[] { "try ", "yourself", "before " };

StringBuilder input = new StringBuilder("Did you try this yourself before asking");
foreach (string word in arrToCheck )
{
    input.Replace(word, "");
}

Because it does all its processing inside it's own data structure, and doesnt allocate hundreds of new strings, I believe you will find it to be far more memory efficient.

answered May 04 '12 at 11:41

akatakritos

9,836
1
23
29

2

Still, input is scanned 450 times. – Nicolas Repiquet May 04 '12 at 11:47
2

Nicolas, do you have a suggestion that doesn't scan each check work? I can't think of any implementation that avoids that. – akatakritos May 04 '12 at 11:52

AlanT · Answer 2 · 2012-05-04T13:05:56.607

There are a few aspects to this

Premature optimization
The method given works and is easy to understand/maintain. Is it causing a performance problem? If not, then don't worry about it. If it ever causes a problem, then look at it.

Expected Results
In the example, what you do want the output to be?

"Did you this asking"

or

"Did you  this   asking"

You haved added spaces to the end of "try" and "before" but not "yourself". Why? Typo?

string.Replace() is case-sensitive. If you care about casing, you need to modify the code.

Working with partials is messy.
Words change in different tenses. The example of 'do' being removed from 'doing' words, but how about 'take' and 'taking'? The order of the stop words matters because you are changing the input. It is possible (I've no idea how likely but possible) that a word which was not in the input before a change 'appears' in the input after the change. Do you want to go back and recheck each time?

Do you really need to remove the partials?

Optimizations
The current method is going to work its way through the input string n times, where n is the number of words to be redacted, creating a new string each time a replacement occurs. This is slow.

Using StringBuilder (akatakritos above) will speed that up an amount, so I would try this first. Retest to see if this makes it fast enough.

Linq can be used

EDIT
Just splitting by ' ' to demonstrate. You would need to allow for punctuation marks as well and decide what should happen with them.
END EDIT

[TestMethod]
public void RedactTextLinqNoPartials() {

    var arrToCheck = new string[] { "try", "yourself", "before" };
    var input = "Did you try this yourself before asking";

    var output = string.Join(" ",input.Split(' ').Where(wrd => !arrToCheck.Contains(wrd)));

    Assert.AreEqual("Did you this asking", output);

}

Will remove all the whole words (and the spaces. It will not be possible to see from where the words were removed) but without some benchmarking I would not say that it is faster.

Handling partials with linq becomes messy but can work if we only want one pass (no checking for 'discovered' words')

[TestMethod]
public void RedactTextLinqPartials() {

    var arrToCheck = new string[] { "try", "yourself", "before", "ask" };
    var input = "Did you try this yourself before asking";

    var output = string.Join(" ", input.Split(' ').Select(wrd => {
        var found = arrToCheck.FirstOrDefault(chk => wrd.IndexOf(chk) != -1);
            return found != null
                   ? wrd.Replace(found,"")
                   : wrd;
    }).Where(wrd => wrd != ""));


    Assert.AreEqual("Did you this ing", output);

}

Just from looking at this I would say that it is slower than the string.Replace() but without some numbers there is no way to tell. It is definitely more complicated.

Bottom Line
The String.Replace() approach (modified to use string builder and to be case insensitive) looks like a good first cut solution. Before trying anything more complicated I would benchmark it under likely performance conditions.

hth,
Alan.

for the example "take" "taking " it will not be a problem, because i am working with another language that there is no issue like that mentioned earlier. — Dedar, May 04 '12 at 13:10
@Dedar at the moment, until performance issues arise (if they do) I would go with the modified Replace(), using stringbuilder and allowing for different cases. — AlanT, May 04 '12 at 13:49

Branko Dimitrijevic · Answer 3 · 2012-05-04T21:00:25.643

Here you go:

var words_to_remove = new HashSet<string> { "try", "yourself", "before" };
string input = "Did you try this yourself before asking";

string output = string.Join(
    " ",
    input
        .Split(new[] { ' ', '\t', '\n', '\r' /* etc... */ })
        .Where(word => !words_to_remove.Contains(word))
);

Console.WriteLine(output);

This prints:

Did you this asking

The HashSet provides extremely quick lookups, so 450 elements in words_to_remove should be no problem at all. Also, we are traversing the input string only once (instead of once per word to remove as in your example).

However, if the input string is very long, there are ways to make this more memory efficient (if not quicker), by not holding the split result in memory all at once.

To remove not just "do" but "doing", "does" etc... you'll have to include all these variants in the words_to_remove. If you wanted to remove prefixes in a general way, this would be possible to do (relatively) efficiently using a trie of words to remove (or alternatively a suffix tree of input string), but what to do when "do" is not a prefix of something that should be removed, such as "did"? Or when it is prefix of something that shouldn't be removed, such as "dog"?

BTW, to remove words no matter their case, simply pass the appropriate case-insensitive comparer to HashSet constructor, for example StringComparer.CurrentCultureIgnoreCase.

--- EDIT ---

Here is another alternative:

var words_to_remove = new[] { " ", "try", "yourself", "before" }; // Note the space!
string input = "Did you try this yourself before asking";

string output = string.Join(
    " ",
    input.Split(words_to_remove, StringSplitOptions.RemoveEmptyEntries)
);

I'm guessing it should be slower (unless string.Split uses a hashtable internally), but is nice and tidy ;)

George Johnston · Answer 4 · 2012-05-04T11:56:01.490

For a simple way to remove a list of strings from your sentence, and aggregate the results back together, you can do the following:

var input = "Did you try this yourself before asking"; 
var arrToCheck = new [] { "try ", "yourself", "before " };
var result = input.Split(arrToCheck, 
                         arrToCheck.Count(), 
                         StringSplitOptions.None)
                  .Aggregate((first, second) => first + second);

This will break your original string apart by your word delimiters, and create one final string using the result set from the split array.

The result will be, "Did you this before asking"

Dunno if it's efficient, but it's smart! – Nicolas Repiquet May 04 '12 at 11:45 — Nicolas Repiquet, May 04 '12 at 11:45

score 0 · Answer 5 · answered May 04 '12 at 11:43

0

shorten your code, and use LINQ

string[] arrToCheck = new string[] { "try ", "yourself", "before " };   
var test = new StringBuilder("Did you try this yourself before asking"); 

arrToCheck.ForEach(x=> test = test.Replace(x, "")); 

Console.Writeln(test.ToString());

answered May 04 '12 at 11:43

Pranay Rana

175,020
35
237
263

do you agree if I used a Hashtable to locate the stop words? – Dedar May 04 '12 at 11:55

score 0 · Answer 6 · edited May 04 '12 at 14:47

0

String.Join(" ",input.
          Split(' ').Where(w=>stop.Where(sW=>sW==w).
                   FirstOrDefault()==null).ToArray());

edited May 04 '12 at 14:47

Pranay Rana

175,020
35
237
263

answered May 04 '12 at 11:47

user1208484

110
2

remove stop words from text C#

6 Answers6

--- EDIT ---

Linked

Related