Why is my wc implementation giving wrong word count?

Question

Here is a small code snippet.

 while((c = fgetc(fp)) != -1)
    {
        cCount++; // character count
        if(c == '\n') lCount++; // line count
        else 
        {
            if(c == ' ' && prevC != ' ') wCount++; // word count
        }
        prevC = c; // previous character equals current character. Think of it as memory.
    }

Now when I run wc with the file containing this above snippet code(as is), I am getting 48 words, but when I use my program on same input data, I am getting 59 words.

How to calculate word count exactly like wc does?

It might be helpful to post the input these results are based on as well. You seem to be assuming that a word ends with a space always. What if its the end of input, some other whitespace character, like \t, or a newline? — GoodDeeds, Nov 02 '17 at 13:31
For one thing your code counts a newline followed by a space as a word. — Kevin, Nov 02 '17 at 13:35
You don't count words at the end of a line, i.e. the line `Hello World\n` will be counted as 1 line and 1 word — Support Ukraine, Nov 02 '17 at 13:37
For those who are asking for input, I already mentioned it in the question. My input is the code snippet that I have posted here. I am running the program on the code snippet itself. — theprogrammer, Nov 02 '17 at 13:38
It seems like you should be able to figure out what you count as words which wc doesn't by just running both on smaller parts of the input until you narrow it down, or by debugging. Comparing two bits of code (one of which isn't even in the question, but should be easy for you to find to do the comparison yourself) doesn't see particularly useful or on topic. — Bernhard Barker, Nov 02 '17 at 13:41

Kevin · Accepted Answer · 2017-11-02T13:46:49.400

You are treating anything that isn't a space as a valid word. This means that a newline followed by a space is a word, and since your input (which is your code snippet) is indented you get a bunch of extra words.

You should use isspace to check for whitespace instead of comparing the character to ' ':

while((c = fgetc(fp)) != EOF)
{
    cCount++;
    if (c == '\n')
        lCount++;
    if (isspace(c) && !isspace(prevC))
        wCount++;
    prevC = c;
}

Chatz · Answer 2 · 2017-11-02T13:49:50.587

1

There is an example of the function you want in the book: "Brian W Kernighan And Dennis M Ritchie: The Ansi C Programming Language". As the author says: This is a bare-bones version of the UNIX program wc. Altered to count only words is like this:

#include <stdio.h>

#define IN 1 /* inside a word */
#define OUT 0 /* outside a word */

/* nw counts words in input */
main()
{
  int c, nw, state;
  state = OUT;
  nw = 0;
  while ((c = getchar()) != EOF) {
    if (c == ' ' || c == '\n' || c == '\t')
       state = OUT;
    else if (state == OUT) {
       state = IN;
       ++nw;
    }
  } 
  printf("%d\n", nw);
 }

edited Nov 02 '17 at 13:49

answered Nov 02 '17 at 13:41

Chatz

56
4

In statement - `if (c == ' ' || c == '\n' || c = '\t')` you are assigning `'\t'` to `c` instead of equility check. – H.S. Nov 02 '17 at 13:49
Thanks for your answer, but for some reason this algorithm doesn't work for some files. Like for example I ran it against a compiled c executable and I am getting different values from this program compared to the real wc. Any ideas? – theprogrammer Nov 02 '17 at 17:40
Well, as i see it, this program was designed to work with "readable" ascii characters (including tabs etc) that you would normally see in a text file. If you run it in a .c file containing the source code, it would probably give the same answer as wc. But running it in a compiled file that contains characters such as SOH, EOT, NUL etc could cause trouble. It's the same probably for example, if you open a compiled file with notepad, and notepad++. You would get different results (special characters printed differently) for the same file. I wouldn' t recommend using it in a compiled file. – Chatz Nov 06 '17 at 07:29

score 0 · Answer 3 · answered Nov 02 '17 at 13:40

0

Instead of checking for spaces only you should check for escape sequences like \t \n space and so on.

This will give the correct results. You can use isspace() from <ctype.h>

Change the line

if(c == ' ' && prevC != ' ') wCount++;

to

if(isspace(c) && !(isspace(prevC)) wCount++;

This would give the correct results. Don't forget to include <ctype.h>

answered Nov 02 '17 at 13:40

Sreedev Shibu

136
7

int isspace ( int c ); is used to check whether c is space or not – Sreedev Shibu Nov 02 '17 at 14:11

H.S. · Answer 4 · 2017-11-02T14:37:49.267

You can do:

int count()
{
    unsigned int cCount = 0, wCount = 0, lCount = 0;
    int incr_word_count = 0;
    char c;
    FILE *fp = fopen ("text", "r");

    if (fp == NULL)
    {
            printf ("Failed to open file\n");
            return -1;
    }

    while((c = fgetc(fp)) != EOF)
    {
            cCount++; // character count
            if(c == '\n') lCount++; // line count
            if (c == ' ' || c == '\n' || c == '\t')
                    incr_word_count = 0;
            else if (incr_word_count == 0) {
                    incr_word_count = 1;
                     wCount++; // word count
            }
    }
    fclose (fp);
    printf ("line : %u\n", lCount);
    printf ("word : %u\n", wCount);
    printf ("char : %u\n", cCount);
    return 0;
}

Why is my wc implementation giving wrong word count?

4 Answers4