(python - cpp) - How to split the c++ codes while writing a lexical analyzer in python?

Question

I wrote a lexical analyzer for cpp codes in python, but the problem is when I use input.split(" ") it won't recognize codes like x=2 or function() as three different tokens unless I add an space between them manually, like: x = 2 . also it fails to recognize the tokens at the beginning of each line. (if i add spaces between each two tokens and also at the beginning of each line, my code works correctly)

I tried splitting the code first by lines then by space but it got complicated and still I wasn't able to solve the first problem. Also I thought about splitting it by operators, yet I couldn't actually implement it. plus I need the operators to be recognized as tokens as well, so this might not be a good idea. I would appreciate it if anyone could give any solution or suggestion, Thank You.

f=open("code.txt")
input=f.read()
input=input.split(" ")

f=open("code.txt")
input=f.read()
input1=input.split("\n")
for var in input1:
 var=var.split(" ")

If parsing code is what you required, you might want to take a look at what an [AST](https://en.wikipedia.org/wiki/Abstract_syntax_tree) is. Implement a simple one yourself, or pick something like: [ANTLR](https://www.antlr.org) — Primemaster, Nov 15 '22 at 11:28

score 0 · Answer 1 · answered Nov 15 '22 at 15:52

Obviously, if you try to have success splitting such an expression like x=2 and also x = 2... it seems pretty obvious that isn't going to work.

What you are looking is for a solution that works with both right?

Basic solution is to use an and operator, and use the conditions that you need to parse. Note that this solution isn't scalable, neither fits into the category of good practices, but it can help you to figure out better but harder solutions.

if input.split(' ') and input.split('='):

An intermediate solution would be to use regex. Regex isn't an easy topic, but you can checkout online documentation, and then you have wonderful online tools to check your regex codes. Regex 101

The last one, would be to convert your input data into an AST, which stands for abstract syntax tree. This is the technique employed by C++ compilers like, for example, Clang. This last one is a real hard topic, so for figure out a basic lexer, probably will be really time consuming, but maybe it could fit your needs.

rici · Answer 2 · 2022-11-16T04:47:51.350

The usual approach is to scan the incoming text from left to right. At each character position, the lexical analyser selects the longest string which fits some pattern for a "lexeme", which is either a token or ignored input (whitespace and comments, for example). Then the scan continues at the next character.

Lexical patterns are often described using regular expressions, but the standard regular expression module re is not as much help as it could be for this procedure, because it does not have the facility of checking multiple regular expressions in parallel. (And neither does the possible future replacement, the regex module.) Or, more precisely, the library can check multiple expressions in parallel (using alternation syntax, (...|...|...)), but it lacks an interface which can report which of the alternatives was matched. [Note 1]. So it would be necessary to try every possible pattern one at a time and select whichever one turns out to have the longest match.

Note that the matches are always anchored at the current input point; the lexical analyser does not search for a matching pattern. Every input character becomes part of some lexeme, even if that lexeme is ignored, and lexemes do not overlap.

You can write such an analyser by hand for a simple language, but C++ is hardly a simple language. Hand-built lexical analysers most certainly exist, but all the ones I've seen are thousands of lines of not very readable code. So it's usually easier to build an analyzer automatically using software designed for that purpose. These have been around for a long time -- Lex was written almost 50 years ago, for example -- and if you are planning on writing more than one lexical analyser, you would be well advised to investigate some of the available tools.

Notes

The PCRE2 and Oniguruma regex libraries provide a "callout" feature which I believe could be used for this purpose. I haven't actually seen it used in lexical analysis, but it's a fairly recent addition, particularly for Oniguruma, and as far as I can see, the Python bindings for those two libraries do not wrap the callout feature. (Although, as usual with Python bindings to C libraries, documentation is almost non-existent, so I can't say for certain.)

(python - cpp) - How to split the c++ codes while writing a lexical analyzer in python?

2 Answers2

Notes