Questions tagged [lexical-analysis]

Process of converting a sequence of characters into a sequence of tokens.

In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function that performs lexical analysis is called a lexical analyzer, lexer, tokenizer, or scanner.

The lexical syntax is usually a regular language, whose atoms are individual characters, while the phrase syntax is usually a context-free language, whose atoms are words (tokens produced by the lexer). While this is a common separation, alternatively, a lexer can be combined with the parser in scannerless parsing.

843 questions
8
votes
1 answer

How to efficiently implement longest match in a lexer generator?

I'm interested in learning how to write a lexer generator like flex. I've been reading "Compilers: Principles, Techniques, and Tools" (the "dragon book"), and I have a basic idea of how flex works. My initial approach is this: the user will supply a…
gsgx
  • 12,020
  • 25
  • 98
  • 149
8
votes
2 answers

How do I implement a two-pass scanner using Flex?

As a pet-project, I'd like to attempt to implement a basic language of my own design that can be used as a web-scripting language. It's trivial to run a C++ program as an Apache CGI, so the real work lies in how to parse an input file containing…
dmercer
  • 397
  • 5
  • 17
7
votes
1 answer

How to parse a tab-separated line of text in Ruby?

I find Ruby's each function a bit confusing. If I have a line of text, an each loop will give me every space-delimited word rather than each individual character. So what's the best way of retrieving sections of the string which are delimited by a…
alamodey
  • 14,320
  • 24
  • 86
  • 112
7
votes
3 answers

FLEX: Is there a way to return multiple tokens at once

In flex, I want to return multiple tokens for one match of a regular expression. Is there a way to do this?
Eburetto
  • 213
  • 2
  • 9
7
votes
4 answers

How to turn a token stream into a parse tree

I have a lexer built that streams out tokens from in input but I'm not sure how to build the next step in the process - the parse tree. Does anybody have any good resources or examples on how to accomplish this?
Evan Fosmark
  • 98,895
  • 36
  • 105
  • 117
7
votes
2 answers

Syntactic predicates in ANTLR lexer rules

Introduction Looking at the documentation, ANTLR 2 used to have something called predicated lexing, with examples like this one (inspired by Pascal): RANGE_OR_INT : ( INT ".." ) => INT { $setType(INT); } | ( INT '.' ) => REAL {…
MvG
  • 57,380
  • 22
  • 148
  • 276
7
votes
1 answer

Which special characters must be escaped when using Python regex module re?

I'm using the Python module re to write regular expressions for lexical analysis. I've been looking for a comprehensive list of which special characters must be escaped in order to be recognized by the regex to no avail. Can someone please point…
Victor Brunell
  • 5,668
  • 10
  • 30
  • 46
7
votes
1 answer

Return multiple tokens in ocamllex

Is there any way to return multiple tokens in OCamlLex? I'm trying to write a lexer and parser for an indentation based language, and I would like my lexer to return multiple DEDENT tokens when it notices that the indentation level is less than it…
Joe Bloggs
  • 571
  • 3
  • 6
  • 14
7
votes
2 answers

Prolog DCG: Writing programming language lexer

I'm trying for the moment to keep my lexer and parser separate, based on the vague advice of the book Prolog and Natural Language Analysis, which really doesn't go into any detail about lexing/tokenizing. So I am giving it a shot and seeing several…
Daniel Lyons
  • 22,421
  • 2
  • 50
  • 77
7
votes
1 answer

Character position in scanner using Lex/Flex

In Lex/Flex is there a way to get the position in the character stream (from the start of the file) that a token appears at? Kind of like yylineno except that it returns the character position as an integer? If not, what's the best way to get at…
ChrisDiRulli
  • 1,482
  • 8
  • 19
  • 28
7
votes
3 answers

How to recognize words in text with non-word tokens?

I am currently parsing a bunch of mails and want to get words and other interesting tokens out of mails (even with spelling errors or combination of characters and letters, like "zebra21" or "customer242"). But how can I know that…
zebra
  • 1,330
  • 1
  • 13
  • 26
7
votes
3 answers

Lexical Analysis of Python Programming Language

Does anyone know where a FLEX or LEX specification file for Python exists? For example, this is a lex specification for the ANSI C programming language: http://www.quut.com/c/ANSI-C-grammar-l-1998.html FYI, I am trying to write code highlighting…
pokstad
  • 3,411
  • 3
  • 30
  • 39
7
votes
3 answers

Parsing Python function calls to get argument positions

I want code that can analyze a function call like this: whatever(foo, baz(), 'puppet', 24+2, meow=3, *meowargs, **meowargs) And return the positions of each and every argument, in this case foo, baz(), 'puppet', 24+2, meow=3, *meowargs,…
Ram Rachum
  • 84,019
  • 84
  • 236
  • 374
7
votes
1 answer

Are there any off-the-shelf solutions for lexical analysis in Haskell that allow for a run-time dynamic lexicon?

I'm working on a small Haskell project that needs to be able to lex a very small subset of strictly formed English in to tokens for semantic parsing. It's a very naïve natural language interface to a system with many different end effectors than…
Doug Stephen
  • 7,181
  • 1
  • 38
  • 46
7
votes
2 answers

Why won't Parsec consider the right-hand side of my <|> alternative?

I’m trying to parse C++ code. Therefore, I need a context-sensitive lexer. In C++, >> is either one or two tokens (>> or > >), depending on the context. To make it even more complex, there is also a token >>= which is always the same regardless of…
user142019