4

I have an input file I need to extract words from. The words can only contain letters and numbers so anything else will be treated as a delimiter. I tried fscanf,fgets+sscanf and strtok but nothing seems to work.

while(!feof(file))
{
    fscanf(file,"%s",string);
    printf("%s\n",string);
}

Above one clearly doesn't work because it doesn't use any delimiters so I replaced the line with this:

 fscanf(file,"%[A-z]",string);

It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.

So I used fgets to read the first line and use sscanf:

sscanf(line,"%[A-z]%n,word,len);
line+=len;

This one doesn't work either because whatever I try I can't move the pointer to the right place. I tried strtok but I can't find how to set delimitters

while(p != NULL) {
printf("%s\n", p);
p = strtok(NULL, " ");

This one obviously take blank character as a delimitter but I have literally 100s of delimitters.

Am I missing something here becasue extracting words from a file seemed a simple concept at first but nothing I try really works?

Ihateparsing
  • 55
  • 1
  • 5

4 Answers4

3

Consider building a minimal lexer. When in state word it would remain in it as long as it sees letters and numbers. It would switch to state delimiter when encountering something else. Then it could do an exact opposite in the state delimiter.

Here's an example of a simple state machine which might be helpful. For the sake of brevity it works only with digits. echo "2341,452(42 555" | ./main will print each number in a separate line. It's not a lexer but the idea of switching between states is quite similar.

#include <stdio.h>
#include <string.h>

int main() {
  static const int WORD = 1, DELIM = 2, BUFLEN = 1024;
  int state = WORD, ptr = 0;
  char buffer[BUFLEN], *digits = "1234567890";
  while ((c = getchar()) != EOF) {
    if (strchr(digits, c)) {
      if (WORD == state) {
        buffer[ptr++] = c;
      } else {
        buffer[0] = c;
        ptr = 1;
      }
      state = WORD;
    } else {
      if (WORD == state) {
        buffer[ptr] = '\0';
        printf("%s\n", buffer);
      }
      state = DELIM;
    }
  }
  return 0;
}

If the number of states increases you can consider replacing if statements checking the current state with switch blocks. The performance can be increased by replacing getchar with reading a whole block of the input to a temporary buffer and iterating through it.

In case of having to deal with a more complex input file format you can use lexical analysers generators such as flex. They can do the job of defining state transitions and other parts of lexer generation for you.

Jan
  • 11,636
  • 38
  • 47
2

Several points:

First of all, do not use feof(file) as your loop condition; feof won't return true until after you attempt to read past the end of the file, so your loop will execute once too often.

Second, you mentioned this:

fscanf(file,"%[A-z]",string);

It reads the first word fine but the file pointer keeps rewinding so it reads the first word over and over.

That's not quite what's happening; if the next character in the stream doesn't match the format specifier, scanf returns without having read anything, and string is unmodified.

Here's a simple, if inelegant, method: it reads one character at a time from the input file, checks to see if it's either an alpha or a digit, and if it is, adds it to a string.

#include <stdio.h>
#include <ctype.h>

int get_next_word(FILE *file, char *word, size_t wordSize)
{
  size_t i = 0;
  int c;

  /**
   * Skip over any non-alphanumeric characters
   */
  while ((c = fgetc(file)) != EOF && !isalnum(c))
    ; // empty loop

  if (c != EOF)
    word[i++] = c;

  /**
   * Read up to the next non-alphanumeric character and
   * store it to word
   */
  while ((c = fgetc(file)) != EOF && i < (wordSize - 1) && isalnum(c))
  {
      word[i++] = c;
  }
  word[i] = 0;
  return c != EOF;
}

int main(void)
{
   char word[SIZE]; // where SIZE is large enough to handle expected inputs
   FILE *file;
   ...
   while (get_next_word(file, word, sizeof word))
     // do something with word
   ...
}
John Bode
  • 119,563
  • 19
  • 122
  • 198
1

I would use:

FILE *file;
char string[200];

while(fscanf(file, "%*[^A-Za-z]"), fscanf(file, "%199[a-zA-Z]", string) > 0) {
    /* do something with string... */
}

This skips over non-letters and then reads a string of up to 199 letters. The only oddness is that if you have any 'words' that are longer than 199 letters they'll be split up into multiple words, but you need the limit to avoid a buffer overflow...

Chris Dodd
  • 119,907
  • 13
  • 134
  • 226
0

What are your delimiters? The second argument to strtok should be a string containing your delimiters, and the first should be a pointer to your string the first time round then NULL afterwards:

char * p = strtok(line, ","); // assuming a , delimiter
printf("%s\n", p);

while(p)
{
    p = strtok(NULL, ",");
    printf("%S\n", p);
} 
Matt Lacey
  • 8,227
  • 35
  • 58