1

I’m just so close, but my program is still not working properly. I am trying to count how many times a set of words appear in a text file, list those words and their individual count and then give a sum of all the found matched words.

If there are 3 instances of “lorem”, 2 instances of “ipsum”, then the total should be 5. My sample text file is simply a paragraph of “Lorem ipsum” repeated a few times in a text file.

My problem is that this code I have so far, only counts the first occurrence of each word, even though each word is repeated several times throughout the text file.

I am using a “pay for” parser called “GroupDocs.Parser” that I added through the NuGet package manager. I would prefer not to use a paid for version if possible.

Is there an easier way to do this in C#?

Here’s a screen shot of my desired results.

enter image description here

Here is the full code that I have so far.

using GroupDocs.Parser;
using System;

using System.Collections.Generic;

using System.IO;

using System.Linq;


namespace ConsoleApp5

{
    class Program
    {
        static void Main(string[] args)
        {

            using (Parser parser = new Parser(@"E:\testdata\loremIpsum.txt"))
            {

                // Extract a text into the reader
                using (TextReader reader = parser.GetText())

                   

                {
                    // Define the search terms. 
                    string[] wordsToMatch = { "Lorem", "ipsum", "amet" };

                    Dictionary<string, int> stats = new Dictionary<string, int>();
                    string text = reader.ReadToEnd();
                    char[] chars = { ' ', '.', ',', ';', ':', '?', '\n', '\r' };
                    // split words
                    string[] words = text.Split(chars);
                    int minWordLength = 2;// to count words having more than 2 characters

                    // iterate over the word collection to count occurrences
                    foreach (string word in wordsToMatch)
                    {
                        string w = word.Trim().ToLower();
                        if (w.Length > minWordLength)
                        {
                            if (!stats.ContainsKey(w))
                            {
                                // add new word to collection
                                stats.Add(w, 1);
                            }
                            else
                            {
                                // update word occurrence count
                                stats[w] += 1;
                            }
                        }
                    }

                    // order the collection by word count
                    var orderedStats = stats.OrderByDescending(x => x.Value);


                    // print occurrence of each word
                    foreach (var pair in orderedStats)
                    {
                        Console.WriteLine("Total occurrences of {0}: {1}", pair.Key, pair.Value);

                    }
                    // print total word count
                    Console.WriteLine("Total word count: {0}", stats.Count);
                    Console.ReadKey();
                }
            }
        }
    }
}

Any suggestions on what I'm doing wrong?

Thanks in advance.

tnw
  • 13,521
  • 15
  • 70
  • 111
Aubrey Love
  • 946
  • 6
  • 12
  • The code you posted doesn't use any third-party parser. As parsers go, there are *several* parser libraries, from ANTLR to parser combinators like FParsec, Sprache and Pidgin. In this case though you can improve your code a lot if you used eg [Regex.Split](https://docs.microsoft.com/en-us/dotnet/api/system.text.regularexpressions.regex.split?view=net-5.0) and split on non-word characters. You can create a dictionary with a case-insensitive StringComparer which would remove the need for `.ToLower()`. You can even use LINQ on the split words with a case-insensitive comparer in `GroupBy` – Panagiotis Kanavos Dec 08 '20 at 16:37
  • Please provide the sample input that you expect to produce the desired output you're showing. And provide the actual output you're getting with the code you've posted. – StriplingWarrior Dec 08 '20 at 17:03

3 Answers3

1

Splitting the entire content of the text file to get a string array of the words is not a good idea because doing so will create a new string object in memory for each word. You can imagine the cost when you deal with big files.

An alternative approach is:

using System;
using System.Collections.Concurrent;
using System.Linq;
using System.IO;
using System.Threading.Tasks;
using System.Text.RegularExpressions;

static void Main(string[] args)
{
    var file = @"loremIpsum.txt";            
    var obj = new object();
    var wordsToMatch = new ConcurrentDictionary<string, int>();

    wordsToMatch.TryAdd("Lorem", 0);
    wordsToMatch.TryAdd("ipsum", 0);
    wordsToMatch.TryAdd("amet", 0);

    Console.WriteLine("Press a key to continue...");
    Console.ReadKey();

    Parallel.ForEach(File.ReadLines(file),
        (line) =>
        {
            foreach (var word in wordsToMatch.Keys)
                lock (obj)
                    wordsToMatch[word] += Regex.Matches(line, word, 
                        RegexOptions.IgnoreCase).Count;
        });

    foreach (var kv in wordsToMatch.OrderByDescending(x => x.Value))
        Console.WriteLine($"Total occurrences of {kv.Key}: {kv.Value}");

    Console.WriteLine($"Total word count: {wordsToMatch.Values.Sum()}");
    Console.ReadKey();
}
dr.null
  • 4,032
  • 3
  • 9
  • 12
  • 1
    Dr.null; Thanks for the feedback and useful links. I wasn’t thinking about the amount of memory used to put each string object in a hold status. Thanks for pointing that out and thanks for the code. Most helpful. – Aubrey Love Dec 09 '20 at 16:36
0

stats is a dictionary, so stats.Count will only tell you how many distinct words there are. You need to add up all the values in it. Something like stats.Values.Sum().

StriplingWarrior
  • 151,543
  • 27
  • 246
  • 315
  • Thank you for your quick reply. My sum is working as it should, it’s the “matched” word count that is not ticking up for each instance. My results should show, “Total occurrences of lorem: 3” “Total occurrences of ipsum: 2” “Total word count: 5” – Aubrey Love Dec 08 '20 at 16:43
0

You can replace this code with a LINQ query that uses case-insensitive grouping. Eg:

char[] chars = { ' ', '.', ',', ';', ':', '?', '\n', '\r' };
var text=File.ReadAllText(somePath);
var query=text.Split(chars)
              .GroupBy(w=>w,StringComparer.OrdinalIgnoreCase)
              .Select(g=>new {word=g.Key,count=g.Count())
              .Where(stat=>stat.count>2)
              .OrderByDescending(stat=>stat.count);

At that point you can iterate over the query or copy the results to an array or dictionary with ToArray(), ToList() or ToDictionary().

This isn't the most efficient code - for one thing, the entire file is loaded in memory. One could use File.ReadLines to load and iterate over the lines one by one. LINQ could be used to iterate over the lines as well:

var lines=File.ReadLines(somePath);
var query=lines.SelectMany(line=>line.Split(chars))
              .GroupBy(w=>w,StringComparer.OrdinalIgnoreCase)
              .Select(g=>new {word=g.Key,count=g.Count())
              .Where(stat=>stat.count>2)
              .OrderByDescending(stat=>stat.count);
Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236