The relationships between Language, Compiler, Regular Expression, Lexical Analysis and Parsing

Question

I am learning compiler and got a bit confused about all those terms/areas about language/compiler.

So here I share my understanding of the relationships between them and hope someone can approve or correct my thought.

It is quite hard for developer to make applications by writing machine code directly, so we need a high level language. And the program we normally write thus become a set of texts.

A language uses regular expressions to define the syntax, i.e., whether all texts in the text program are good or not.

The task of the compiler is to translate those texts to machine code following the rule of the language definition.

The first two steps of a compiler are lexical analysis and parse.

The lexical analysis convert the regular expressions to a NFA / DFA, and work through the program texts, and validate them and convert them to tokens.

The parsing deal with those tokens and check their semantics.

Am I right about all above?

Another question is that so a definition of a language is regular expression and we use parsing part to validate the program's grammar?

score 4 · Accepted Answer · answered May 24 '14 at 11:43

And the program we normally write thus become a set of texts.

The word "text" isn't really a common term in compiler construction (or at least not one I heard before). Often a program first is translated into a sequence of tokens (which are basically the "words"¹ of the language) and then that sequence is translated into a syntax tree. That tree may then be further transformed and will finally be translated into a sequence of machine instructions, which make up the compiled program.

A language uses regular expressions to define the syntax, i.e., whether all texts in the text program are good or not.

The syntax of a language describes which programs are structurally valid or not (not taking into account type errors and runtime errors, which are handled separately). You can not do that using regular expressions as the vast majority of languages are not regular, that is they're more complicated than what a regular expression can describe. For example you can't say "for every opening parenthesis there must be a closing parenthesis" using a regular expression.

Regular expressions are often used to describe the tokens of a language. That is you can say "identifiers in the language match the regex [a-zA-Z_][a-zA-Z0-9_]* and numbers match the regex [0-9]+".

How those tokens fit together to form a complete program is then described in a grammar.

The first two steps of a compiler are lexical analysis and parse.

Usually, yes.

The lexical analysis convert the regular expressions to a NFA / DFA, and work through the program texts, and validate them and convert them to tokens.

If you use a lexer generator, then the generator will take a bunch of regular expressions you gave to it and convert those to automata and will then produce code based on those. That generated code is the lexer, which will take the program source and produce a sequence of tokens.

Note that the conversions between regular expressions and automata happens when the generator runs, not as part of your compiler. And if you write the lexer by hand, no conversion between regular expressions and automata will happen at all (except possibly in your head).

The parsing deal with those tokens and check their semantics.

No. The parsing phase takes the tokens and makes sure that they conform to the syntax of the language. If they do, it will perform actions based on the syntactic structure of the language. Often that means building a syntax tree. For simple languages it is also possible to do semantic analyses (like type checking) and code generating directly in the parser.

If you do build a syntax tree, consequent phases will then go over that tree and that's where the language's semantics come into play.

Another question is that so a definition of a language is regular expression and we use parsing part to validate the program's grammar?

The definition of a language's syntax is generally given as a grammar, not a regular expression. As I said regular expressions aren't expressive enough for that. We do use parsing to validate that a given program conforms to the language's grammar (as well as to determine the syntactic structure of the program).

The definition of a language consists of the definition of the language's syntax and the definition of its semantics. The latter is often given in text form.

¹ Here I'm using the colloquial meaning of the word "word", not its language-theoretic meaning.

really helpful answer. Thanks. By `Regular expressions are often used to describe the tokens of a language`, you mean each token can be a regular expression? Also that's why in some `lex` tool like ocamllex we define token type and each constructor with a regex? — Jackson Tale, May 24 '14 at 12:04
@JacksonTale I mean there's a regular expression for each token, yes. I wouldn't say that the token *is* the regex though. The token is what gets produced by the lexer. Like if you have `[0-9]+ { IntegerLiteral (int_of_string (lexeme lexbuf) }` and run that on the input `42`, then the token would be `IntegerLiteral 42`. — sepp2k, May 24 '14 at 12:10
one more question: during lexi, every bit of the program must be included into some tokens, right? otherwise, the program texts somehow wrong, correct? — Jackson Tale, May 24 '14 at 14:00
Yes, if there are characters in the program that do not belong to some token, the program is invalid. — sepp2k, May 24 '14 at 14:09
could you please have a look at http://stackoverflow.com/questions/23854582/an-elegant-way-to-parse-sexp pls? — Jackson Tale, May 25 '14 at 12:01

user207421 · Answer 2 · 2014-05-24T12:30:31.497

It is quite hard for developer to make applications by writing machine code directly, so we need a high level language. And the program we normally write thus become a set of texts.

OK.

A language uses regular expressions to define the syntax, i.e., whether all texts in the text program are good or not.

No. A language uses a context free grammar to define the syntax, and possibly regular expressions to define the lexicon. Regular expressions cannot represent recursion, so they can't be used to define programming languages that have recursive syntax, which is practically all of them.

The task of the compiler is to translate those texts to machine code following the rule of the language definition.

OK.

The first two steps of a compiler are lexical analysis and parse.

OK.

The lexical analysis convert the regular expressions to a NFA / DFA

No. The program that generates the lexical analyser does that, if there is one. The generated analyzer just uses the NFA or DFA directly.

and work through the program texts, and validate them and convert them to tokens.

No. It only does the latter. The parser does most of the validation, along with a phase that has been called the 'static semantics' phase.

The parsing deal with those tokens

Yes.

and check their semantics.

No. Parsing has nothing to do with semantics. That's the province of the rest of the compiler.

Another question is that so a definition of a language is regular expression

No, see above.

and we use parsing part to validate the program's grammar?

No, to validate the program against the grammar.

"Regular expressions cannot represent recursion, so they can't be used to define programming languages that have it" A recursive language in the language-theoretic sense is not a language that has recursion. Whether a language supports recursive function definitions is not related to whether it can be parsed by a regex. — sepp2k, May 24 '14 at 12:14
@sepp2k I don't know what you may mean by 'recursive function definitions', but I didn't say a word about them, or about recursive function invocation either. I was talking about recursive syntax, just as you were. — user207421, May 24 '14 at 12:21
Oh, I see. I misunderstood that. By recursive function definitions I meant definitions of recursive functions. Like in some old languages you could not define a function that's recursive (because those languages didn't have the concept of a call stack), so that's what I'd usually call a programming language that does not have recursion. — sepp2k, May 24 '14 at 12:29

The relationships between Language, Compiler, Regular Expression, Lexical Analysis and Parsing

2 Answers2