9

So I've tried searching for a solution to this extensively but can only really find posts where the new line or null byte is missing from one of the strings. I'm fairly sure that's not the case here.

I am using the following function to compare a word to a file containing a list of words with one word on each line (dictionary in the function). Here is the code:

int isWord(char * word,char * dictionary){
  FILE *fp;
  fp = fopen(dictionary,"r");
  if(fp == NULL){
    printf("error: dictionary cannot be opened\n");
    return 0;
  }
  if(strlen(word)>17){
    printf("error: word cannot be >16 characters\n");
    return 0;
  }
  char longWord[18];
  strcpy(longWord,word);
  strcat(longWord,"\n");
  char readValue[50] = "a\n";
  while (fgets(readValue,50,fp) != NULL && strcmp(readValue,longWord) != 0){
    printf("r:%sw:%s%d\n",readValue,longWord,strcmp(longWord,readValue));//this line is in for debugging
  }
  if(strcmp(readValue,longWord) == 0){
    return 1;
  }
  else{
    return 0;
  }
}

The code compiles with no errors and the function reads the dictionary file fine and will print the list of words as they appear in there. The issue I am having is that even when the two strings are identical, strcmp is not returning 0 and so the function will return false for any input.

eg I get:

r:zymoscope
w:zymoscope
-3

Any ideas? I feel like I must be missing something obvious but have been unable to find anything in my searches.

Xephz
  • 103
  • 1
  • 4
  • 9
    I'd guess you're on a Windows machine, that you've read the file with CRLF line endings, and that you've not stripped the endings accurately. There's a difference of 3 between the value of `'\r'` and `'\n'` in many (most) codesets. It's curious you have one `printf()` printing all that data, but no `\n` in the format string. You're relying on the newlines in the data, which seems a little suspect. (Write a function to print the bytes in a string in hex; call it on each string; spot the difference.) – Jonathan Leffler Jul 27 '15 at 18:32
  • Your length check also assumes a single `\n`, you may be off by one if `\r\n` is used – Basic Jul 27 '15 at 18:34
  • 3
    Also, your file is never closed. Do all checks possible *before* opening the file. Close it when done (or on errors). –  Jul 27 '15 at 18:35
  • @Jonathan I'm running Linux (openSUSE) however it is more than possible that the list of words I downloaded was created in windows. – Xephz Jul 27 '15 at 18:36
  • Print out the length of both strings. If it's not 9, then you have extra characters that might not match. You clearly have a line ending after the `r:` line. – David Schwartz Jul 27 '15 at 18:38
  • 4
    Suggest `readValue[strcspn(readValue, "\r\n")] = 0;` right after using `fgets(readValue,50,fp)` to eliminate line ending characters. – chux - Reinstate Monica Jul 27 '15 at 18:39
  • 1
    OK; in some ways, that scenario (Linux reading a file possibly — probably — created on Windows) makes more sense. Provide yourself with the string dump function I suggested and use it. You might use: `static void dump_string(const char *tag, const char *string) { size_t len = strlen(string); printf("%s (%zu):", tag, len); size_t i; for (i = 0; i < len; i++) { printf(" %.2X", (unsigned char)string[i]); if (i % 16 == 15) putchar('\n'); } if (i % 16 != 0) putchar('\n'); }` and call it: `dump_string("r", readValue); dump_string("w", longWord);` or similar. – Jonathan Leffler Jul 27 '15 at 18:41
  • 2
    For future 1) When printing troublesome strings, use something like `printf("'%s'\n", bad_string);` to help identify leading and trailing white-space, new-lines, etc. 2) Rather than thinking `strcmp()` is wrong, ask why the strings are not equal. – chux - Reinstate Monica Jul 27 '15 at 18:47
  • What did your debugger tell you was in the two strings being compared? – Martin James Jul 27 '15 at 18:52
  • @JonathanLeffler Yes you're correct the file uses \r\n, the functions is now working correctly. – Xephz Jul 27 '15 at 18:55
  • @FelixPalmen Thanks for point that out, the close line has been lost at some point during debugging. Am I understanding you correctly that I should be looking to do as little as possible while the file is open in general then? If so, why? – Xephz Jul 27 '15 at 18:58
  • You need to close the file before your function returns (if it was succesful opening it), as otherwise you leak file streams — the one opened in the first call cannot be closed, and there is an upper bound on the number of file streams you can have open, even if they're all pointing at the same file. – Jonathan Leffler Jul 27 '15 at 19:00
  • 1
    @Xephz out of laziness (or, better, avoiding duplicate code). If the file is already opened, you need an additional `fclose()` in your check for the length of the word for cleanup. –  Jul 27 '15 at 19:01
  • @Xephz adding to that, **iff** your function gets lengthy and there are error checks you can only perform after the file was opened, this is one of the **rare** cases a `goto` is useful and *good practice*, e.g. `if (some error cond) { ret = -1; goto cleanup; }` –  Jul 27 '15 at 19:05
  • 1) always close the open file before returning 2) this line: 'strcat(longWord,"\n");' when strlen(word) == 17, then this line: 'strcat(longWord,"\n");' is writing past the end of the longWord buffer. This results in undefined behaviour and can lead to a seg fault event – user3629249 Jul 27 '15 at 20:33
  • @FelixPalmen, goto .. very bad idea. rather write a 'cleanup()' function and pass it, as parameters, what ever it needs to cleanup, like open files, malloc'd memory pointers, etc. – user3629249 Jul 27 '15 at 20:35
  • @user3629249 -- a comment like this shows you have **no** C knowledge whatsoever. –  Jul 27 '15 at 20:39
  • the 'while' loop walks through the 'dictionary' file, and will output a message for every word, until the desired word is (if ever) found. In a dictionary with thousands of words, that is thousands of lines of output that do not help the program perform its' function. – user3629249 Jul 27 '15 at 20:40

2 Answers2

5

I see you are appending a newline to your test strings to try to deal with the problem of fgets() retaining the line endings. Much better to fix this at source. You can strip all trailing stuff like this, immediately after reading from file.

readValue [ strcspn(readValue, "\r\n") ] = '\0';   // remove trailing newline etc
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Weather Vane
  • 33,872
  • 7
  • 36
  • 56
  • 2
    I would add that this is completely safe for a `'\0'` terminated string. If there are no kind of line-endings, `strcspn()` returns the string length, thus harmlessly overwriting the `'\0'` which was already present. – Weather Vane Jul 27 '15 at 19:43
4

The string you are reading contains trailing character(s), and hence is not the same as the string you are comparing it against.

Remove the trailing newline (and CR if that is there); then you do not need to add any newline or carriage return to the string being compared:

int isWord(char *word, char *dictionary){
  FILE *fp;
  fp = fopen(dictionary, "r");
  if (fp == NULL){
    fprintf(stderr, "error: dictionary cannot be opened\n");
    return 0;
  }
  if (strlen(word) > 16){
    fprintf(stderr, "error: word cannot be >16 characters\n");
    return 0;
  }
  char readValue[50];
  while (fgets(readValue, 50, fp) != NULL){
    char *ep = &readValue[strlen(readValue)-1];

    while (*ep == '\n' || *ep == '\r'){
      *ep-- = '\0';
    }
    if (strcmp(readValue, word) == 0){
      return 1;
    }
  }
  return 0;
}
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • 1
    Note: `readValue[strlen(readValue)-1]` can result in undefined behavior/wrong in the rare situation where the first character read is the null character. Checking that the string length is greater than 0 is one way to fix that. – chux - Reinstate Monica Jul 27 '15 at 19:07
  • 1
    `while (*ep == '\n' || *ep == '\r'){ *ep-- = '\0'; } *ep-- = '\0';` is a problem when `readValue` is `"\n"` or `"\r\n"`. – chux - Reinstate Monica Jul 27 '15 at 19:08
  • And, though he was too polite to say it, @chux's answer does avoid both those problems (though it might be argued to stop prematurely on an input line with a CR in the middle, separate from a CRLF or LF line ending). And the need to do a separate `strlen()` operation is a reason to prefer the POSIX [`getline()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/getline.html) function over `fgets()`; it returns the length of the data it read, which (amongst other things) allows you to read past leading null bytes if you consider that desirable. – Jonathan Leffler Jul 27 '15 at 19:23
  • Yes I can see how both of those would be problematic, I originally chose this answer as it seemed more general in that it dealt with different possible line endings. @chux, now it seems like it would be best to simply use your solution with a check for each possible line ending. – Xephz Jul 27 '15 at 19:36