Searching for a token in a file and copying each character after it into a string until another token is found

Question

I am writing a program which seaches a file and each time it comes across the '<' charachter, it copies it and each following character into a string until a '>' is reached. So far this is what I've done:

while(!file.eof()){
    char c;
    string tag;

    file.get(c);

    if(c == '<'){

        tag_num++;
        tag += c;
    }       
}

How can I now continue the file.get(c), adding each character to tag until '>' is reached?

My idea, which I can't seem to get to work, was to add a while(file.get(c) != '>') loop within the if loop which would consist of another file.get(c) and each of these characters would be copied into tag.

Making custom XML readers is a lot of work. Why are you not using one of the built in ones? — Brian from state farm, Nov 05 '15 at 20:16
I'm new to C++ and was unaware that there was built in ones. Where can get I get information on this? — KOB, Nov 05 '15 at 20:19
`file.get(c)` doesnt return the value of c, it returns whether it was successful (if i am not mistaken). have a `while(c!='>')` then call `file.get(c)` inside of the loop — R Nar, Nov 05 '15 at 20:20

Colin Pitrat · Answer 1 · 2016-02-28T12:44:14.367

Parsing file manually quickly becomes tricky. You may want to have a look at recursive descent parser.

It's a pattern that consist in implementing a grammar of a file by having a function decoding each element in a recursive manner.

Let's take a simple example with a simplified XML grammar (in BNF form):

element ::= '<'<tag>'/>'|'<'<tag>'>'<content>'<'<tag>'/>'
content ::= <element>|<freetext>|<freetext><element>
freetext ::= [^<>]<freetext>
tag ::= <alpha>|<alpha><alphanum>
alpha ::= [a-zA-Z]
alphanum ::= [a-zA-Z0-9]

(I think the [...] syntax for a regexp is not part of BNF but it's simpler for me than writing down all letters :-) The [^<] denotes any char that is not a < that would conflict with a the beginning of a tag in XML)

This grammar describes an element. An element is composed either of a self-closing tag (ex: <br/>) or a start tag followed by a content and then a end tag. The content can be an element (hence recursive definition using the previous element), some freetext, or some freetext followed by an element. Etc ...

The parsing can then be implemented mechanically:

Element parse_element(char *c)
{
    Element myElement; // Element contains the result of the parsing
                       // It's a type you have to define !
    assert( *c == '<' ); // Handle the error in a more clever way :-)
    c++;
    Tag myTag = parse_tag(c);
    if( *c == '/') 
    {
         // Self-closing tag - add myTag to myElement
         c++;
         assert( *c == '>'); // Here again, better error handling
         c++;
    }
    else
    {
         // Or a start tag
         assert( *c == '>'); // Here again, better error handling
         c++;
         Content myContent = parse_content(c);
         // Add myTag with myContent to myElement
         assert( *c == '/'); // Here again, better error handling
         c++;
         assert( *c == '>'); // Here again, better error handling
         c++;
    }
    return myElement;
}

I hope this function is enough to get an idea of the concept. The main point to understand is that you first need to have a clear grammar of the format to read defined. Then, you can mechanically implement the parser.

Note that this example is way too simple: you would need at least to handle entities, attributes etc ... to parse real XML.

Some tools like GNU Bison ease the writing of the code once you have your grammar.

Finally, as already stated in comments, some XML parsers like libxml exist if you want to parse XML files. It will be much easier and much more complete than implementing your own parser. XML is a very complex format.

You're right I changed it. I thought the 5 XML entities were forbidden but in fact only `<` and `&` are strictly illegal, `>`, `'` and `"` can appear. In the example, I only talked about `<` and `>` for simplicity, `&` becomes necessary if we add entities. — Colin Pitrat, Feb 28 '16 at 12:41

Searching for a token in a file and copying each character after it into a string until another token is found

1 Answers1