0

Im trying to use Dictionary of for mapping some words (the int doesnt really so relevant). after inserting the word to the dic (I checked it) i try to go over the whole doc and look for a specific word.

when i do that, even if the word exist in dic, it return false.

what can be the problem and how can i fix it?

public string RemoveStopWords(string originalDoc){
        string updatedDoc = "";
        string[] originalDocSeperated = originalDoc.Split(' ');
        foreach (string word in originalDocSeperated)
        {
            if (!stopWordsDic.ContainsKey(word))
            {
                updatedDoc += word;
                updatedDoc += " ";
            }
        }
        return updatedDoc.Substring(0, updatedDoc.Length - 1); //Remove Last Space
    }

for examle: the dic contains stop words as the word "the". when i get a word "the" from the originalDoc and then wanna check if it is not exist, it still enter the IF statement And both of them write the same! no case sensitivity

Dictionary<string, int> stopWordsDic = new Dictionary<string, int>();

string stopWordsContent = System.IO.File.ReadAllText(stopWordsPath);
            string[] stopWordsSeperated = stopWordsContent.Split('\n');
            foreach (string stopWord in stopWordsSeperated)
            {
                stopWordsDic.Add(stopWord, 1);
            }

The stopWords file is a file which in each line there is a word

snapshot: enter image description here

thank you

Grundy
  • 13,356
  • 3
  • 35
  • 55
michal_h
  • 51
  • 7

4 Answers4

3

This is just a guess (just too long for a comment), but when you are inserting on your Dictionary, you are splitting by \n.

So if the actual splitter in the text file you are using is \r\n, you'd be left with \r's on your inserted keys, thus not finding them on ContainsKey.

So I'd start with a string[] stopWordsSeperated = stopWordsContent.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None); then trim


As a side note, if you are not using the dictionary int values for anything, you'd be better of using a HashSet<string> and Contains instead of ContainsKey

Jcl
  • 27,696
  • 5
  • 61
  • 92
  • I think splitting by Environment.NewLine instead of just \n should also help (at least under windows) – Thomas Nov 13 '15 at 09:03
  • @Thomas right, but just in case, as the words file may not come from the running environment, doing it by both `\r\n` and `\n` would be safer I guess and would work for both unixy and windowsy files – Jcl Nov 13 '15 at 09:04
  • @Thomas, depends how this saved in files – Grundy Nov 13 '15 at 09:04
  • @Jcl Thank You Vey Much! the \n\r was the problem! I appreciate it – michal_h Nov 13 '15 at 09:11
  • It was a wild guess, but one we all have done one time or another. Glad you got it working :-) – Jcl Nov 13 '15 at 09:12
1

You have a ! (not) operator in your if statement. You're checking to see if the dictionary does Not contain a key. Remove the exclamation mark from the start of your condition.

Darren Gourley
  • 1,798
  • 11
  • 11
  • thats what i wanna do, to c if it not contains. the problem is when it contains, it returns false too – michal_h Nov 13 '15 at 08:31
  • @michal_h your original post was a bit misleading there. Thus with your updated question and comment this answer although it fullfills the original question does not fullfill the corrected question. – Thomas Nov 13 '15 at 08:33
  • @michal_h can we see the declaration of `stopWordsDic`? A dictionary must contain a key and a value. Perhaps your actual string is the value? In which case you would need to use: `if (!stopWordsDic.ContainsValue(word)) {...}` – Darren Gourley Nov 13 '15 at 08:44
0

When you create the dictionary you would need to do the following:

var stopWords= new Dictionary<string, int>(
    StringComparer.InvariantCultureIgnoreCase);

The most important part is the InvariantCultureIgnoreCase.

public string RemoveStopWords(string originalDoc){
    return String.Join(" ", 
           originalDoc.Split(' ')
              .Where(x => !stopWordsDic.ContainsKey(x))
    );
}

Furthermore you should change how you fill the dictionary (this eliminates all non word symbols from your dictionary when creating it):

        // Regex to find the first word inside a string regardless of the 
        // preleading symbols. Cuts away all nonword symbols afterwards
        Regex validWords = New Regex(@"\b([0-9a-zA-Z]+?)\b");

        string stopWordsContent = System.IO.File.ReadAllText(stopWordsPath);
        string[] stopWordsSeperated = stopWordsContent.Split('\n');

        foreach (string stopWord in stopWordsSeperated)
        {
            stopWordsDic.Add(validWords.Match(stopWord).Value, 1);
        }
Thomas
  • 2,886
  • 3
  • 34
  • 78
0

I see that you're setting 1 as the value for all entries. Maybe a List would better fit your needs:

List<string> stopWordsDic = new List<string>();

string stopWordsContent = System.IO.File.ReadAllText(stopWordsPath);
string[] stopWordsSeperated = stopWordsContent.Split(Environment.NewLine);
foreach (string stopWord in stopWordsSeperated)
{
    stopWordsDic.Add(stopWord);
}

Then check for element with Contains()

public string RemoveStopWords(string originalDoc){
    string updatedDoc = "";
    string[] originalDocSeperated = originalDoc.Split(' ');
    foreach (string word in originalDocSeperated)
    {
        if (!stopWordsDic.Contains(word))
        {
            string.Format("{0}{1}", word, string.Empty);
            //updatedDoc += word;
            //updatedDoc += " ";
        }
    }
    return updatedDoc.Substring(0, updatedDoc.Length - 1); //Remove Last Space
}
Phate01
  • 2,499
  • 2
  • 30
  • 55
  • 1
    I'd use a `HashSet` which would match the performance of `Dictionary`, rather than a `List` – Jcl Nov 13 '15 at 09:09
  • You can't insert two identical strings into a HasSet, so it would be ok if the OP has no needs to insert identical strings – Phate01 Nov 13 '15 at 09:12
  • 2
    A `Dictionary` doesn't allow inserting two identical keys either, so if it works on a dictionary using the keys, it'll work for a `HashSet`. A `Dictionary` is just a `HashSet` for the keys with associated values – Jcl Nov 13 '15 at 09:13