Do not match if a char is between quotation marks(AKA has a programming string pattern)

Question

I have been assigned to write a compiler for Basic programming language. In basic, codes are separated with new lines or by : mark. e.g to following to codes are valid.
Model# 1

 10 PRINT "Hello World 1" : PRINT "Hello World 2"

Model# 2

 10 PRINT "Hello World 1"
 20 PRINT "Hello World 2"

You can test those here.
The First thing i need to do, before parsing codes in my compiler is to split codes.
I have already splited codes in lines but i am stucked with finding a regex to split The following code sample:
This following code sample should be splited in 2 PRINT codes.

 10 PRINT "Hello World 1" : PRINT "Hello World 2"

But DO NOT match this:
The following code sample is a single standalone command.

 10 PRINT "Hello World 1" ": PRINT Hello World 2"

Question

Any regex pattern to DO match the first of above code samples which : is outside of pair of " and DO NOT match the second one?

Can anybody help me out here?
Any thing would help. :)

You shouldn't be parsing this kind of construct with regexps. Regexps can only match regular languages, which doesn't fit your problem. You should instead be using a construct such as [this](https://bitbucket.org/stormqueen1990/decimalcalc4j/src/b1d3203e7c3f9d70d41eb4c8636251cadf2271c0/src/parser/Parser.java?at=master). — Mauren, Mar 29 '14 at 21:28
@Mauren Yes indeed. I will eventually do that, but at first i need to tokenize the source code, and purify the codes(i.e removing comments and etc...). So i believe i need to tokenize the `:` at first. — dariush, Mar 29 '14 at 21:35
I would advise you to tokenize by constructing a loop where you look at each character and decide to which token it belongs, instead of doing it by regexps. Please take a look at line 86 of the previously linked source code. — Mauren, Mar 29 '14 at 21:36
@Mauren: I'm agree that a "full" regex solution is not the best way for this kind of task, however, don't believe that a library like boost (or other modern regex tools) is unable to match non regular languages. We are far from theorical considerations and POSIX regex engines capabilities. — Casimir et Hippolyte, Mar 29 '14 at 21:52

score 1 · Answer 1 · answered Mar 29 '14 at 21:45

I believe the best option for you is tokenize your source code by using a device such as a loop, instead of trying to tokenize it by using regexps.

In pseudocode

string lexeme;
token t;

for char in string
    if char fits current token
        lexeme = lexeme + char;
    else
        t.lexeme = lexeme;
        t.type = type;
        lexeme = null;
    end if
    // other treatments here
end for

You can see a real-world implementation of this device in this source code, more specifically at line 86.

The below answer is the implementation of your proposal. Thanks. — dariush, Mar 30 '14 at 01:00

score 0 · Answer 2 · answered Mar 29 '14 at 21:32

0

The idea to avoid this kind of problem is to match content inside quotes before trying to match colons example:

"(?>[^\\"]++|\\{2}|\\.)*"|:

You can add capturing groups to know which part of the alternation has been matched.

However, the good tool to make this kind of task is probably lex/yacc

answered Mar 29 '14 at 21:32

Casimir et Hippolyte

88,009
5
94
125

@Dariush: What are you trying to do? – Casimir et Hippolyte Mar 30 '14 at 01:05
@Dariush: the approach is different. If you put the first part of the alternation, in a capturing group, you only need to check if the capturing group is not void or exist to know if the pattern match a colon outside any quotes. – Casimir et Hippolyte Mar 30 '14 at 01:12

dariush · Accepted Answer · 2014-03-30T01:34:02.150

Thanks to @Mauren I managed to do what i wanted to do.
Here is my code(maybe help someone later):
Note that the source file's content contained in char* buffer and vector<string> source_code.

    /* lines' tokens container */
    std::string token;
    /* Tokenize the file's content into seperate lines */
    /* fetch and tokenizing line version of readed data  and maintain it into the container vector*/
    for(int top = 0, bottom = 0; top < strlen(buffer) ; top++)
    {
        /* inline tokenizing with line breakings */
        if(buffer[top] != '\n' || top == bottom)
        { /* collect current line's tokens */ token += char(buffer[top]); /* continue seeking */continue; }
        /* if we reach here we have collected the current line's tokens */
        /* normalize current tokens */
        boost::algorithm::trim(token);
        /* concurrent statements check point */
        if(token.find(':') != std::string::npos)
        {
            /* a quotation mark encounter flag */
            bool quotation_meet = false;
            /* process entire line from beginning */
            for(int index = 0; true ; index++)
            {
                /* loop's exit cond. */
                if(!(index < token.length())) { break; }
                /* fetch currently processing char */
                char _char = token[index];
                /* if encountered  a quotation mark */
                /* we are moving into a string */
                /* note that in basic for printing quotation mark, should use `CHR$(34)` 
                 * so there is no `\"` to worry about! :) */
                if(_char == '"')
                {
                    /* change quotation meeting flag */
                    quotation_meet = !quotation_meet;
                    /* proceed with other chars. */
                    continue;
                }
                /* if we have meet the `:` char and also we are not in a pair quotation*/
                if(_char == ':' && !quotation_meet)
                {
                    /* this is the first sub-token of current token */
                    std::string subtoken(token.substr(0, index - 1));
                    /* normalize the sub-token */
                    boost::algorithm::trim(subtoken);
                    /* add sub-token as new line */
                    source_codes.push_back(subtoken);
                    /* replace the rest of sub-token as new token */
                    /**
                     * Note: We keep the `:` mark intentionally, since every code line in BASIC 
                     * should start with a number; by keeping `:` while processing lines starting with `:` means 
                     * they are meant to execute semi-concurrent with previous numbered statement.
                     * So we use following `substr` pattern instead of `token.substr(index + 1, token.length() - 1);`
                     */
                    token = token.substr(index, token.length() - 1);
                    /* normalize the sub-token */
                    boost::algorithm::trim(token);
                    /* reset the index for new token */
                    index = 0;
                    /* continue with other chars */
                    continue;
                }
            }
            /* if we have any remained token and not empty one? */
            if(token.length())
                /* a the tokens into collection */
                goto __ADD_TOKEN;
        }
__ADD_TOKEN:
        /* if the token is not empty? */
        if(token.length())
            /* add fetched of token to our source code */
            source_codes.push_back(token);
__NEXT_TOKEN:
        /* move pointer to next tokens' position */
        bottom = top + 1;
        /* clear the token buffer */
        token.clear();
        /* a fail safe for loop */
        continue;
    }
    /* We NOW have our source code departed into lines and saved in a vector */

Do not match if a char is between quotation marks(AKA has a programming string pattern)

Question

3 Answers3