Copying entire input line in (f)lex (for better error messages)?

Question

As part of a typical parser using yacc (or bison) and lex (or flex), I'd like to copy entire input lines in the lexer so that, if there's an error later, the program can print out the offending line in its entirety and put a caret ^ under the offending token.

To copy the line, I'm currently doing:

char *line;        // holds copy of entire line
bool copied_line;

%%

^.+  {
       if ( !copied_line ) {
          free( line );
          line = strdup( yytext );
          copied_line = true;
       }
       REJECT;
     }

/* ... other tokens ... */

\n   { copied_line = false; return END; }

This works, but, from stepping in a debugger, it's really inefficient. What seems to be going on is that the REJECT is causing the lexer to back off one character at a time rather than just jumping to the next possible match.

Is there a better, more efficient way to get what I want?

Unsure whether it is an option here, but you could try to redefine the `YY_INPUT` macro which is used to feed the parser and use it to store the current line and position. — Serge Ballesta, Apr 06 '17 at 08:07
Aha! I had forgotten about `YY_INPUT`. Let me give that a try. — Paul J. Lucas, Apr 06 '17 at 15:57

Paul J. Lucas · Answer 1 · 2017-04-07T20:09:46.340

1

Based on the hint from @Serge Ballesta of using YY_INPUT:

#define YY_INPUT( BUF, RESULT, MAX_SIZE ) \
  (RESULT) = lexer_get_input( (BUF), (MAX_SIZE) )

static size_t column;     // current 0-based column
static char  *input_line;

static size_t lexer_get_input( char *buf, size_t buf_size ) {
  size_t bytes_read = 0;

  for ( ; bytes_read < buf_size; ++bytes_read ) {
    int const c = getc( yyin );
    if ( c == EOF ) {
      if ( ferror( yyin ) )
        /* complain and exit */;
      break;
    }
    buf[ bytes_read ] = (char)c;
    if ( c == '\n' )
      break;
  } // for

  if ( column == 0 && bytes_read < buf_size ) {
    static size_t input_line_capacity;
    if ( input_line_capacity < bytes_read + 1/*null*/ ) {
      input_line_capacity = bytes_read + 1/*null*/;
      input_line = (char*)realloc( input_line, input_line_capacity );
    }
    strncpy( input_line, buf, bytes_read );
    input_line_len = bytes_read;
    input_line[ input_line_len ] = '\0';
  }

  return bytes_read;
}

The first time this is called, column will be 0, so it will copy the entire line into input_line. On subsequent calls, nothing special needs to be done. Eventually, column will be reset to 0 upon encountering a newline; then the next time the function is called, it will again copy the line.

This seems to work and is a lot more efficient. Anybody see any problems with it?

edited Apr 07 '17 at 20:09

answered Apr 06 '17 at 16:49

Paul J. Lucas

6,895
6
44
88

@rici Funny how I don't see a better answer from you. The `getline()` solution wouldn't work in the off chance the line was longer than flex's buffer. – Paul J. Lucas Apr 08 '17 at 06:16
The `for` loop in my solution is pretty much what the default flex code does for `YY_INPUT` and it's likely what `getline()` does internally except it's _less_ complicated because it doesn't have to care about extending the buffer. The added `if` for doing the copy is the only special-case handling. Additionally, it copies the line only when we've got an entire line which is the point. – Paul J. Lucas Apr 09 '17 at 05:05
Maybe I misread your intent; I was going by the code in your question which ensures that the entire input line is copied as soon as the scan reaches the first character of that line (and not, for example, just the amount of the line which fits in flex's internal buffer.) Anyway, I make no claims about the comparative speed of my code, but I do note that the flex default code you refer to is the code for an interactive lexer; except in rare applications, line-at-a-time reading is reasonable for interactive use but getline didn't exist when flex was written. – rici Apr 09 '17 at 05:23
@rici You have a point (about ensuring the entire input line has been read). Perhaps I'll change the accepted answer to yours. – Paul J. Lucas Apr 10 '17 at 11:33

rici · Accepted Answer · 2017-04-09T05:29:50.050

Here's a possible definition of YY_INPUT using getline(). It should work as long as no token includes both a newline character and the following character. (A token could include a newline character at the end.) Specifically, current_line will contain the last line of the current token.

On successful completion of the lexical scan, current_line will be freed and the remaining global variables reset so that another input can be lexically analysed. If the lexical scan is discontinued before end of input is reached (for example, because the parse was unsuccessful), an explicit call should be made to reset_current_line() in order to perform these tasks.

char* current_line = NULL;
size_t current_line_alloc = 0;
ssize_t current_line_sent = 0;
ssize_t current_line_len = 0;

void reset_current_line() {
  free(current_line);
  current_line = NULL;
  current_line_alloc = current_line_sent = current_line_len = 0;
}

ssize_t refill_flex_buffer(char* buf, size_t max_size) {
  ssize_t avail = current_line_len - current_line_sent;
  if (!avail) {
    current_line_sent = 0;
    avail = getline(&current_line, &current_line_alloc, stdin);
    if (avail < 0) {
      if (ferror(stdin)) { perror("Could not read input: "); }
      avail = 0;
    }
    current_line_len = avail;
  }
  if (avail > max_size) avail = max_size;
  memcpy(buf, current_line + current_line_sent, avail);
  current_line_sent += avail;
  if (!avail) reset_current_line();
  return avail;
}

#define YY_INPUT(buf, result, max_size) \
  result = refill_flex_buffer(buf, max_size);

Although the above code does not depend on maintaining the current column position, it is important if you want to identify where the current token is in the current line. The following will help provided you don't use yyless or yymore:

size_t current_col = 0, current_col_end = 0;
/* Call this in any token whose last character is \n,
 * but only after making use of column information.
 */
void reset_current_col() {
  current_col = current_col_end = 0;
}
#define YY_USER_ACTION \
  { current_col = current_col_end; current_col_end += yyleng; }

If you are using this scanner with a parser with lookahead, it may not be sufficient to keep only one line of the input stream, since the lookahead token may be on a subsequent line to the error token. Keeping several retained lines in a circular buffer would be a simple enhancement, but it is not at all obvious how many lines are necessary.

IMHO, having to keep track of your own buffering is more complicated. There's really nothing gained by really wanting to use `getline()`. — Paul J. Lucas, Apr 09 '17 at 05:01
@paul: since you are unlikely to have inputs with lines longer than flex's buffer size, it is probably acceptable to cut off the saved line at the flex buffer limit, so I agree that the value added by this code is at best theoretical. Error messages are not improved by dumping 80,000-byte long source lines into the error log, although one could argue that it is better to show the error context than the beginning of such a line. Keeping the entire line does avoid having to check later on if it is there, though. I did check this code with 80kb lines, fwiw. — rici, Apr 09 '17 at 05:45
If you know the column of the offending token and the line is very long, you can print an excerpt of the line in window-width chunks --- but that's an issue for printing the error message, not getting the line in the first place. — Paul J. Lucas, Apr 10 '17 at 11:36
BTW: I don't think you should reset either `current_line` or `current_line_alloc`: just leave them to reduce unnecessary frees and reallocs. — Paul J. Lucas, Apr 10 '17 at 11:38
@PaulJ.Lucas: That's certainly an option. In real applications I would use a reentrant scanner and attempt to ensure that all memory was reclaimed when the scanner was destroyed, in which case the call to `reset_current_line` would be part of the internal API which frees a scanner context struct (including the extra data). In the non-reentrant case, it's probably just pedantry to free the buffer since flex doesn't arrange for its own buffer to be freed, so valgrind will report leaked memory in any case. — rici, Apr 10 '17 at 16:12

score 0 · Answer 3 · answered Apr 06 '17 at 07:26

0

Under the assumption that your input stems from a seekable stream:

Count the number N of newlines encountered
In case of error, seek and output line N + 1

Even if input is from a non-seekable stream, you could save all characters in a temporary store.

Variations on this theme are possible, such as storing the offset of the last newline seen, so you can directly seek to it.

answered Apr 06 '17 at 07:26

Jens

69,818
15
125
179

Unfortunately, the input can come from stdin (that can come from anywhere, including the terminal), so it's not seekable. – Paul J. Lucas Apr 06 '17 at 14:43

score 0 · Answer 4 · answered Apr 08 '17 at 01:02

0

In flex, you can use YY_USER_ACTION, which, if defined as a macro, will run for every token, just before running the token action. So something like:

#define YY_USER_ACTION  append_to_buffer(yytext);

will append yytext to a buffer where you can later use it.

answered Apr 08 '17 at 01:02

Chris Dodd

119,907
13
134
226

This won't work. The idea is that you want the line, in its entirety, to print in the event there's an error. If you append token by token and there's an error at the current token, the line you've been appending to won't have the tokens from where you are now through the end of the line. – Paul J. Lucas Apr 08 '17 at 14:50

Copying entire input line in (f)lex (for better error messages)?

4 Answers4

Linked