0

edit: I found strtok doenst get the string corrects for some reason how can i fix that?

I have this exercise which states that I must find the occurrence of each word and display it with the word next to it using calloc (for example for the word "World" that is found 2 times in the text will display like this: World 2).My algorithm doesn't work the output is nonsense for example: "1,78c1,78"

This is my algorithm where I don't know what wrong with it

#include <stdio.h>
#include <string.h>
#include<stdlib.h>
int main(int argc, char *argv[]){

   FILE *file=fopen(argv[1],"r");
   if(!file) {
  perror("File opening failed");
  return 1;
}
   int size = 1024;
   int *wordcount = calloc(size, sizeof(int));
   
   char **words = calloc(size, sizeof(char*));
   char line[1024];
   int p=0;
   int i=0;
     while (fgets(line, sizeof(line), file) != NULL) {
        char *word = strtok(line, " ");
        while (word != NULL) {
            for (int j=0;j<i;j++){
                if (strcmp(words[j],word)==0){
                    p=1;
                    wordcount[j]++;
                    break;
                } 
            }
            if (p==0){
              words[i]=strdup(word);
              wordcount[i]++;
              i++;
            }
            p=0;
            word = strtok(NULL, " ");
        
        }
    }
    for (int i = 0; i < size; i++) {
        if (words[i] != NULL) {
            printf("%s %d", words[i], wordcount[i]);
            printf("\n");
        }
    }

     fclose(file);
    for (int i = 0; i < size; i++) {
        if (words[i] != NULL) {
            free(words[i]);
        }
    }
    free(words);
    free(wordcount);

    return 0;
}

Here is one of the inputs I have been given that I have to run through tests.

Computer science is the study of computation, automation, and information.[1] Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (including the design and implementation of hardware and software).[2][3][4] Computer science is generally considered an area of academic research and distinct from computer programming.[5]

1 Answers1

2
  1. Check if fopen succeeds.
FILE *file=fopen(argv[1],"r");
if(!file) {
  perror("File opening failed");
  return 1;
}
  1. strtok does not allocate new memory.

strtok returns a pointer into the string it is tokenizing, line. You're storing pointers to line in words. line is a buffer which gets reused every time fgets is called. Every time you read a line, the contents of line change, and so does the words pointed to by words.

For example, if you're given...

The quick brown fox
jumped over the lazy grey dog.

Your first iteration would be have line and words like so...

line = "The0quick0brown0fox\n"
words  0^  1^    2^    3^

strtok inserts nulls into the string to separate the tokens, here represented with 0. words[0] is "The", words[1] is "quick", and so on.

line = "jumped0over0the0lazy0grey0dog.\n"
words  0^  1^    2^    3^
       4^     5^   6^  7^   8^   9^

words[4] is now "jumped", but so is words[0]. words[1] is "ed". words[2] is "r", and so on.

Instead, copy the word before putting a pointer to it in words.

if (p=0){
  words[i]=strdup(word);
  wordcount[i]++;
}

This technique of reading lines into a buffer and then copying the parts you need is a common way of reading and parsing files.

  1. Check what you're comparing.
            for (int j=0;j<i;j++){
                if (strcmp(words[i],word)==0){
                    p=1;
                    wordcount[j]++;
                    break;
                } 
            }

The intent is to check if you've already seen the word before. You're looping over j, but you're comparing against words[i]. words[i], at this point, is an empty string. It will be like you never saw the word.

Instead, compare with words[j].

  1. Don't walk off the end of the array.

You always increment i even if you don't add to wordcount. This means i is a count of the number of words seen, not a count of the length of words.

Once you see a word twice, for (int j=0; j < i; j++) { will walk off the end of words.

Instead, only increment i when you see a new word. i is a bad variable name and should only be used for counters. It is the next index to use in words. Use something descriptive like next_words_idx.

  1. Strip newlines.

If you have...

one two three
one two

...there are two twos; "two" and "two\n". You need to remove the trailing newlines to have them match.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • @GiorgosZigouris In the for loop you have `if (strcmp(words[i],word)==0){` I think you should be checking `words[j]`. – Schwern Dec 19 '22 at 02:43
  • @GiorgosZigouris Since at that point `words[i]` will always be blank, `if (strcmp(words[i],word)==0)` will never be true. Every word will be considered a new word. Seems like that's what's happening. – Schwern Dec 19 '22 at 02:46
  • @GiorgosZigouris You always increment `i` even if you don't add to `words`. That means you can walk off the end of `words`. Try `printf("i: %d, j: %d, %s, %s\n", next_words, j, words[j], word);` to see. `i` is a bad variable name, it should only be used as an iterator. Use something like `next_words_idx`. – Schwern Dec 19 '22 at 03:25