Direct answer to question in comment
My question still isn't answered — I just want to know what is causing it to not return 0
.
Because:
- you are running on Windows,
- the file is opened as a binary file, and
- the character that terminates words at the end of a line is CR and not LF.
When you next call the function, it reads the LF in the first loop and ignores it because it is not alphabetic.
Main answer
Succinctly, your code does recognize newlines — at least on Linux.
#include <stdio.h>
#include <ctype.h>
enum { MAX_WORD = 50 };
static
int getWord(FILE *in, char str[])
{
int ch;
int i = 0;
while (!isalpha(ch = getc(in)) && ch != EOF)
;
if (ch == EOF)
return -1;
str[i++] = tolower(ch);
while (isalpha(ch = fgetc(in)) && ch != EOF)
{
if (i < MAX_WORD)
str[i++] = tolower(ch);
}
if (ch == '\n')
return 0;
str[i] = '\0'; // Bug; should be before the if
return 1;
}
int main(void)
{
char buffer[MAX_WORD];
int rc;
while ((rc = getWord(stdin, buffer)) >= 0)
printf("Got: %d (%s)\n", rc, buffer);
return 0;
}
Given the input file:
blossom flower
bewilder confound confuse perplex
dwell live reside
The program produces the output:
Got: 1 (blossom)
Got: 0 (flowerm)
Got: 1 (bewilder)
Got: 1 (confound)
Got: 1 (confuse)
Got: 0 (perplex)
Got: 1 (dwell)
Got: 1 (live)
Got: 0 (residex)
Note that you get stray left over characters in the word when you read a newline (when 0 is returned) and the current word is shorter than the previous word. You could get bad behaviour if the last word on the line is longer than any prior word and the stack is messy enough. You can fix that bug by moving the null termination before the if
condition. The output is then:
Got: 1 (blossom)
Got: 0 (flower)
Got: 1 (bewilder)
Got: 1 (confound)
Got: 1 (confuse)
Got: 0 (perplex)
Got: 1 (dwell)
Got: 1 (live)
Got: 0 (reside)
Note that on Windows, if the program gets to read a '\r'
(the CR part of the CRLF line endings), then the zero return would be skipped because the character terminating the word was '\r'
, and in the next call to the function, the first loop skips the '\n'
.
Please note that indicating platform (Unix vs Windows) would help clarify the question and get answers more quickly.
Note that when I create a DOS (Windows) format file, data.dos
, and read that with the same (bug fixed) binary (running on an Ubuntu 14.04 derivative), the output is:
Got: 1 (blossom)
Got: 1 (flower)
Got: 1 (bewilder)
Got: 1 (confound)
Got: 1 (confuse)
Got: 1 (perplex)
Got: 1 (dwell)
Got: 1 (live)
Got: 1 (reside)
This exactly corresponds to the 'CR terminates the word and the first loop skips the newline' scenario. You could also debug by adding printing statements in strategic places:
#include <stdio.h>
#include <ctype.h>
enum { MAX_WORD = 50 };
static
int getWord(FILE *in, char str[])
{
int ch;
int i = 0;
while (!isalpha(ch = getc(in)) && ch != EOF)
{
if (ch == '\n') printf("Got-1 '\\n'\n");
else if (ch == '\r') printf("Got-1 '\\r'\n");
else printf("Got-1 '%c'\n", ch);
}
if (ch == EOF)
return -1;
str[i++] = tolower(ch);
while (isalpha(ch = fgetc(in)) && ch != EOF)
{
if (i < MAX_WORD)
str[i++] = tolower(ch);
}
if (ch == '\n') printf("Got-2 '\\n'\n");
else if (ch == '\r') printf("Got-2 '\\r'\n");
else printf("Got-2 '%c'\n", ch);
str[i] = '\0';
if (ch == '\n')
return 0;
return 1;
}
int main(void)
{
char buffer[MAX_WORD];
int rc;
while ((rc = getWord(stdin, buffer)) >= 0)
printf("Got: %d (%s)\n", rc, buffer);
return 0;
}
And on the Unix file, the output is now:
Got-2 ' '
Got: 1 (blossom)
Got-2 '\n'
Got: 0 (flower)
Got-2 ' '
Got: 1 (bewilder)
Got-2 ' '
Got: 1 (confound)
Got-2 ' '
Got: 1 (confuse)
Got-2 '\n'
Got: 0 (perplex)
Got-2 ' '
Got: 1 (dwell)
Got-2 ' '
Got: 1 (live)
Got-2 '\n'
Got: 0 (reside)
And with the Windows file:
Got-2 ' '
Got: 1 (blossom)
Got-2 '\r'
Got: 1 (flower)
Got-1 '\n'
Got-2 ' '
Got: 1 (bewilder)
Got-2 ' '
Got: 1 (confound)
Got-2 ' '
Got: 1 (confuse)
Got-2 '\r'
Got: 1 (perplex)
Got-1 '\n'
Got-2 ' '
Got: 1 (dwell)
Got-2 ' '
Got: 1 (live)
Got-2 '\r'
Got: 1 (reside)
Got-1 '\n'
Note that Unix/Linux does not treat the CRLF combination specially; they are just two adjacent characters in the input stream.