Questions tagged [lexical-analysis]

Process of converting a sequence of characters into a sequence of tokens.

In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function that performs lexical analysis is called a lexical analyzer, lexer, tokenizer, or scanner.

The lexical syntax is usually a regular language, whose atoms are individual characters, while the phrase syntax is usually a context-free language, whose atoms are words (tokens produced by the lexer). While this is a common separation, alternatively, a lexer can be combined with the parser in scannerless parsing.

843 questions
4
votes
3 answers

how to recognize a set of key words in a text

I have a huge set of key words. Given a text , I want to be able to recognize only those words that occur in the key list of words and ignore all the other words. What is the best way to approach this?
kc3
  • 4,281
  • 7
  • 20
  • 16
4
votes
2 answers

Is the dot of dot notation an operator or something else ? How do you know?

I am trying to classify the "dot" token used in the dot notation (object.property). Being a self-taught amateur developper, mainly using JavaScript, I have a simplified (and certainly imperfect) understanding of programming and JavaScript. When…
4
votes
2 answers

Where can I find the full syntax of C that is necessary to implement a compiler?

My aim is not to write a C compiler, however I do require the full syntax of the C programming language. This will allow me to write program(s) to format, manage, and analyze C programs and libraries more easily. To achieve that, I have no option…
machine_1
  • 4,266
  • 2
  • 21
  • 42
4
votes
2 answers

How can lexing efficiency be improved?

In parsing a large 3 gigabyte file with DCG, efficiency is of importance. The current version of my lexer is using mostly the or predicate ;/2 but I read that indexing can help. Indexing is a technique used to quickly select candidate clauses of a …
Guy Coder
  • 24,501
  • 8
  • 71
  • 136
4
votes
1 answer

How to properly scan for identifiers using Ragel

I'm trying to write a scanner for my C/C++/C#/Java/D-like programming language that I'm designing for personal reasons. For this task I'm using Ragel to generate my scanner. I'm having trouble understanding exactly when a lot of the operators…
Sion Sheevok
  • 4,057
  • 2
  • 21
  • 37
4
votes
1 answer

Get Prolog DCG arguments generated from sentence being parsed

I'm putting together a lexer/parser for a simple programming language using a Prolog DCG that builds up the list of tokens/syntax tree using DCG arguments, e.g. symbol(semicolon) --> ";". symbol(if) --> "if". and then the syntax tree is built using…
4
votes
1 answer

Function of the various Lexer commands in ANTLR4. Is my interpretation correct? What do each of them do?

I have starting learning to write a lexer in ANTLR 4.5. From this page, which serves as documentation, I see that the following Lexer commands exist : more, pushMode(x), popMode, type(x), channel(x), mode(x), skip. I have not been able to clearly…
GoodDeeds
  • 7,956
  • 5
  • 34
  • 61
4
votes
1 answer

Are there some tools to check if a fortran procedure modifies its argument?

Are there tools that can be used to check which arguments of a fortran procedure is being defined or not inside the procedure? I mean something like a lexical analyzer that simply check if a variable is being used on the left hand side of an…
innoSPG
  • 4,588
  • 1
  • 29
  • 42
4
votes
1 answer

Does PLY's lexer support "maximal munch"?

The syntax of many programming languages requires that they be tokenized according to the "maximal munch" principle. That is, that tokens be built from the maximum possible number of characters from the input stream. PLY's lexer does not seem to…
user200783
  • 13,722
  • 12
  • 69
  • 135
4
votes
5 answers

Find the Range of the Nth word in a String

What I want is something like "word1 word2 word3".rangeOfWord(2) => 6 to 10 The result could come as a Range or a tuple or whatever. I'd rather not do the brute force of iterating over the characters and using a state machine. Why reinvent the…
Andrew Duncan
  • 3,553
  • 4
  • 28
  • 55
4
votes
2 answers

Recognize Identifiers in Chinese characters by using Lex/Yacc

How can I use Lex/Yacc to recognize identifiers in Chinese characters?
WuFa
  • 411
  • 1
  • 4
  • 5
4
votes
3 answers

Regular expressions versus lexical analyzers in Haskell

I'm getting started with Haskell and I'm trying to use the Alex tool to create regular expressions and I'm a little bit lost; my first inconvenience was the compile part. How I have to do to compile a file with Alex?. Then, I think that I have to…
Anny
  • 71
  • 2
4
votes
2 answers

Regular expression for HTML tags

I am working on Lexical Analyzer. I have an HTML file. I want to convert every letter in the file expect whatever written within an HTML tag into CAPITAL letter. Example: StackOverFlow This will be…
Surajeet Bharati
  • 1,363
  • 1
  • 18
  • 36
4
votes
3 answers

Including an external header file in Flex

I am writing a program using flex that takes input from a text file and splits them into some tokens like identifier, keywords, operators etc. My file name is test.l. I have made another hash table program which includes a file named SymbolTable.h .…
SKB
  • 153
  • 1
  • 12
4
votes
3 answers

ANTLR4: lexer rule for: Any string as long as it doesn't contain these two side-by-side characters?

Is there any way to express this in ANTLR4: Any string as long as it doesn't contain the asterisk immediately followed by a forward slash? This doesn't work: (~'*/')* as ANTRL throws this error: multi-character literals are not allowed in lexer…
Roger Costello
  • 3,007
  • 1
  • 22
  • 43