1

I am trying to write an XML scanner in C++. I would ideally like to use the regex library as it would be much easier.

However, I'm a little stumped as to how to do it. So, first I need to create the regular expressions for each token in the language. I could use a map to store pairs of these regexes in addition to the name of the token.

Next, I would open an input file and want to use an iterator to iterate through the strings in my file and match them to a regex. However, in XML, you don't have spacing to separate strings.

So my question is will this method even work? Also, how exactly will the regex library fit my needs? Is regex_match enough to fit my needs in a foolproof way so that my scanner isn't tricked?

I'm just trying to create a skeleton of the process in my head so that I can start working on this. I wanted some input from others to see if I'm thinking about the problem correctly.

I'd appreciate any thoughts on this. Thanks so much!

Jane Doe
  • 33
  • 4
  • 2
    Why reinvent the wheel? lex/flex has been around for decades, and has all the kinks ironed out. – Sam Varshavchik Oct 12 '16 at 02:17
  • I'm learning how to do lexical analysis. Just having code generated for me wouldn't be all that helpful. – Jane Doe Oct 12 '16 at 02:32
  • I agree that such tools are useful, but I would like to learn how to do it myself. – Jane Doe Oct 12 '16 at 02:33
  • Well, maybe you should then write a regular expression evaluator all by yourself, then? Even that regex library does that work for you. – Sam Varshavchik Oct 12 '16 at 02:33
  • True. But looking at those other tools, it's a lot less readable while I have an understanding of regular expressions as far as building them. I could do by-hand scanning, but I've also read that another option is to do it using regular expressions. – Jane Doe Oct 12 '16 at 02:36
  • I was just asking a question about regular expressions and if I was on the right path in my thinking. I do understand reinventing the wheel is pointless, however. – Jane Doe Oct 12 '16 at 02:37

2 Answers2

0

Lexical analysis usually proceeds by sequentially matching tokens, where each token corresponds to the longest possible match from a set of possible regular expressions. Since each match is anchored where the previous token ended, no searching is performed.

Here, I use the word "token" slightly loosely; whitespace and comments are also matched as tokens, but in most programming languages they are simply ignored after being recognised. A conformant XML tokenizer would need to recognize them as tokens, though, so the usage would be precise for your problem domain.

Rather than immersing yourself in a sea of annoying details, you might want to learn about (f)lex, which efficiently implements this algorithm given a collection of regular expressions. It also takes care of buffer handling and some other details which let you concentrate on understanding the nature of the lexical analysis process.

rici
  • 234,347
  • 28
  • 237
  • 341
0

There is a tool for this, called RE/flex that generates scanners:

https://sourceforge.net/projects/re-flex

The generated scanners use regex engines such as Boost.Regex. Boost.Regex is used via an API to handle different types of input, so there is some additional C++ code. Not the bare-bones Boost.Regex API calls that you may be looking for.

The examples included with RE/flex includes an XML scanner in C++ that may help you to get started. RE/flex also supports UTF-8 encoding which you will need to properly scan XML.

Dr. Alex RE
  • 1,772
  • 1
  • 15
  • 23