0

tl;dr How to get a Bison/Flex parser to periodically run code that checks for an interruption request from the user?


I am looking to make a Bison/Flex based parser stop cleanly in response to interactive input. In other words, the parser should periodically check for user interruption, and if an interruption request is detected, it should exit cleanly. I know that I can stop a Bison parser using YYABORT, but I am not sure where to insert the interruption checks. Which Bison rule is run is determined by the contents of the input file. Is there a way to specify that a certain piece of code should be run periodically regardless of the contents of the file that is being parsed? Should the interruption checks be handled on the Bison side of Flex side?

Szabolcs
  • 24,728
  • 9
  • 85
  • 174

2 Answers2

2

Take a look at flex's YY_USER_ACTION, the code in that macro is run every time a token is recognized. I'm not sure if bison has anything similar.

SoronelHaetir
  • 14,104
  • 1
  • 12
  • 23
  • This looks good so far. The missing piece is: how can I cleanly return from Flex, making sure that all data structures, including Bison's, will be freed? Should I define a special token just for this purpose, return it from Flex, and handle that token specially in Bison? – Szabolcs Sep 19 '22 at 18:22
2

In the standard parser/lexer model, the parser knows absolutely nothing about the input mechanism. It simply transforms a stream of tokens into a parse tree. "Files" and "interactive input" are not part of a parser's data model, and you'll find it much more convenient to maintain that separation.

A Bison parser can use YYABORT to clean up and terminate (by returning the error code 1). That's the same error return as is produced by a syntax error. It's important to use YYABORT in order to free used resources, particularly if the parser stack includes allocated objects. So, as you say, the question resolves to how the lexer communicates the desire to terminate.

Here, the lexer's options are limited. It can return a special-purpose token, not used in any parser rule, which will trigger a syntax error. Or it could just return 0, indicating that there is no more input, which might or might not trigger a syntax error. (Of those options, I'd go for returning 0, but there's not much difference.)

If the parser is doing anything more complicated than building up an AST -- for example, if it will actually attempt to produce some product, like executable code, then you will want to include a mechanism which suppresses further processing. That could be through a global (yuk!), or shared state communicated between the parser and the lexer using Bison's additional parameter declarations. The shared state could be as simple as a boolean flag, which might need to be checked:

  • in yyerror, in order to suppress the syntax error;
  • in any parser error action, which should YYABORT on premature end of input;
  • in the parser's final reduction action (that is, the reduction to the start symbol), which should suppress further processing and probably call YYABORT;
  • in whoever called the parser, in order to correctly interpret yyparse's error return. So an easy solution would be to add a %param declaration in your Bison file for a bool* parameter, remembering to adjust the prototype for yylex, yyerror, and other functions which need the extra parameter.

How you actually detect the interrupt in your lexical scanner is a separate problem. Parsing a buffer's worth of input does not usually take a noticeable amount of time, so the easiest solution might be to let the interruption produce an EOF indication for the lexer, and then attempt to figure out whether the EOF was a real end of input or a user interrupt either in your <<EOF>> action or in an implementation of yywrap.

rici
  • 234,347
  • 28
  • 237
  • 341
  • If I understand you correctly, there is no direct way to signal the interruption from Flex to Bison by sending a special interruption token. Instead, I need to somehow trigger an error, and then use a different channel (e.g. passed through a `yyparse` parameter) to communicate that this was not an error but an interruption. Is this correct? – Szabolcs Sep 20 '22 at 09:08
  • @Szabolcs: That's correct, although I'd say that interruption *is* an error, but it's not a syntax error (so it needs to be reported differently, or not at all). In effect, the parser/lexer interface is as though the parser were the simple loop: `while ((token = yylex())) { handle(token); }`, which doesn't leave much room for communicating anything else. – rici Sep 20 '22 at 17:09
  • @Szabolcs: However, it's possible to produce a more flexible interface by using a push parser. Perhaps I should add that to the answer. In that architecture, rather than attempting to force the parser to perform `YYABORT`, parser's context object is simply deleted using the provided interface. – rici Sep 20 '22 at 17:09
  • @Szabolcs: Also, when I said that "interruption is an error", I was assuming the resuming input is impossible. There are contexts in which you might want to pause a parse in order to acquire more input, but that's a very different problem and I don't think that your question refers to that scenario. (For that case, I'd suggest using a thread, though.) – rici Sep 20 '22 at 17:43
  • Resuming is not possible. This is in fact for the graph format readers of the igraph library, many of which are based on Bison. I am working with this old code (and learning Bison on the way). igraph is typically used from high-level programming languages like Python, R or Mathematica, and often used interactively (e.g. interactive data exploration). For this reason, most functions are interruptible. If an operation is taking too long, the user can just cancel it. I want to extend this feature to the graph format readers, because reading very large file may take long and because ... – Szabolcs Sep 20 '22 at 17:50
  • ... using OSS-fuzz for a while showed that with some formats it is possible to create corrupted files which take very long to parse (think a maliciously crafted file designed to lock up a system). Interruptions are indeed errors in igraph. On the igraph side it's just another error code. But I have to get from Flex to Bison and back to igraph. If I just define an arbitrary token and return that from Flex, Bison will report a parsing error ("unexpected token"). I need a way to distinguish it from an interruption, so the correct error code can be used at the top level. – Szabolcs Sep 20 '22 at 17:52
  • Thanks for the hints so far. I think it's clear enough how to proceed from here. I'll accept once I've implemented it and it works. – Szabolcs Sep 20 '22 at 17:53
  • If you figure out more details about the parser/lexer consuming excess time with maliciously-crafted inputs, it's probably possible to fix that. Not much to be done about huge files, though :-). Note that if you want to check to see if you received an invalid token, perhaps in `yyerror`, you have to ensure that the invalid token has its own internal code, which means using it in at least one production. `error` productions are good for this, because it's possible to craft one which can't actually happen, but there are other possibilities, too. – rici Sep 20 '22 at 18:34
  • It seems to me that he cleanest solution is to use the push parser interface. Is there any disadvantage in doing so, other than breaking compatibility with old Bison versions? (macOS includes Bison 2.3, but the push interface was added only in 2.3b). Are there any performance implications to switching from the pull interface to the push one? – Szabolcs Sep 22 '22 at 10:51