Optimizing a word parser

Question

Context:

I have a code/text editor than I'm trying to optimize. Currently, the bottleneck of the program is the language parser than scans all the keywords (there is more than one, but they're written generally the same).

On my computer, the the editor delays on files around 1,000,000 lines of code. On lower-end computers, like Raspberry Pi, the delay starts happening much sooner (I don't remember exactly, but I think around 10,000 lines of code). And although I've never quite seen documents larger than 1,000,000 lines of code, I'm sure that they're there and I want my program to be able to edit them.

Question:

This leads me to the question: what's the fastest way to scan for a list of words within large, dynamic string?

Here's some information that may affect the design of the algorithm:

the keywords
qualifying characters allowed to be part of a keyword, (I call them qualifiers)
the large string

Bottleneck-solution:

This is (roughly) the method I'm currently using to parse strings:

// this is just an example, not an excerpt
// I haven't compiled this, I'm just writing it to
// illustrate how I'm currently parsing strings

struct tokens * scantokens (char * string, char ** tokens, int tcount){

    int result = 0;
    struct tokens * tks = tokens_init ();

    for (int i = 0; string[i]; i++){

        // qualifiers for C are: a-z, A-Z, 0-9, and underscore
        // if it isn't a qualifier, skip it

        while (isnotqualifier (string[i])) i++;

        for (int j = 0; j < tcount; j++){

            // returns 0 for no match
            // returns the length of the keyword if they match
            result = string_compare (&string[i], tokens[j]);

            if (result > 0){ // if the string matches
                token_push (tks, i, i + result); // add the token
                // token_push (data_struct, where_it_begins, where_it_ends)
                break;
            }
        }

        if (result > 0){
            i += result;
        } else {
            // skip to the next non-qualifier
            // then skip to the beginning of the next qualifier

            /* ie, go from:
                'some_id + sizeof (int)'
                 ^

            to here:
                'some_id + sizeof (int)'
                           ^
            */
        }
    }

    if (!tks->len){
        free (tks);
        return 0;
    } else return tks;
}

Possible Solutions:

Contextual Solutions:

I'm considering the following:

Scan the large string once, and add a function to evaluate/adjust the tokens markers every time there is user input (instead of re-scanning the entire document over and over). I expect that this will fix the bottleneck because there is much less parsing involved. But, it doesn't completely fix the program, because the initial scan may still take a really long time.
Optimize token-scanning algorithm (see below)

I've also considered, but have rejected, these optimizations:

Scanning the code that is only on the screen. Although this would fix the bottleneck, it would limit the ability to find user-defined tokens (ie variable names, function names, macros) that appear earlier on than where the screen starts.
Switching the text into a linked list (a node per line), rather than a monolithic array. This doesn't really help the bottleneck. Although insertions/deletions would be quicker, the loss of indexed access slows down the parser. I think that, also, a monolithic array is more likely to be cached, than a broken-up list.
Hard coding a scan-tokens function for every language. Although this could be the best optimization for performance, it's doesn't seem practical in a software development point of view.

Architectural solution:

With assembly language, a quicker way to parse these strings would be to load characters into registers and compare them 4 or 8 bytes at a time. There are some additional measures and precautions that would have to be taken into account, such as:

Does the architecture support unaligned memory access?
All strings would have to be of size s, where s % word-size == 0, to prevent reading violations
Others?

But these issues seem like they can be easily fixed. The only problem (other than the usual ones that come with writing in assembly language) is that it's not so much an algorithmic solution as it is a hardware solution.

Algorithmic Solution:

So far, I've considered having the program rearrange the list of keywords to make a binary search algorithm a little more possible.

One way I've thought about rearranging them for this is by switch the dimensions of the list of keywords. Here's an example of that in C:

// some keywords for the C language

auto  // keywords[0]
break // keywords[1]
case char const continue // keywords[2], keywords[3], keywords[4]
default do double
else enum extern
float for
goto
if int
long
register return
short signed sizeof static struct switch
typedef
union unsigned
void volatile
while

/* keywords[i] refers to the i-th keyword in the list
 *
 */

Switching the dimensions of the two-dimensional array would make it look like this:

    0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3
    1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
  -----------------------------------------------------------------
1 | a b c c c c d d d e e e f f g i i l r r s s s s s s t u u v v w
2 | u r a h o o e o o l n x l o o f n o e e h i i t t w y n n o o h
3 | t e s a n n f   u s u t o r t   t n g t o g z a r i p i s i l i
4 | o a e r s t a   b e m e a   o     g i u r n e t u t e o i d a l
5 |   k       i u   l     r t           s r t e o i c c d n g   t e
6 |           n l   e     n             t n   d f c t h e   n   i
7 |           u t                       e               f   e   l
8 |           e                         r                   d   e

// note that, now, keywords[0] refers to the string "abccccdddeeefffiilrr"

This makes it more efficient to use a binary search algorithm (or even a plain brute force algorithm). But it only words for the first characters in each keyword, afterwards nothing can be considered 'sorted'. This may help in small sets of words like an a programming language, but it wouldn't be enough for a larger set of words (like in the entire English language).

Is there more than can be done to improve this algorithm?

Is there another approach that can be taken to increase performance?

Notes:

This question from SO doesn't help me. The Boyer-Moore-Horspool algorithm (as I understand it) is an algorithm for finding a sub-string within a string. Since I'm parsing for multiple strings I think there's much more room for optimization.

If you want to do it fast then you don't loop with a string compare and a list of string, but build a finite state machine based on each character of ALL of the strings that triggers when a keyword is found. The Lex utility does this. — Jiminion, Aug 21 '13 at 03:23

rici · Accepted Answer · 2013-08-21T03:55:18.097

Aho-Corasick is a very cool algorithm but it's not ideal for keyword matches, because keyword matches are aligned; you can't have overlapping matches because you only match a complete identifier.

For the basic identifier lookup, you just need to build a trie out of your keywords (see note below).

Your basic algorithm is fine: find the beginning of the identifier, and then see if it's a keyword. It's important to improve both parts. Unless you need to deal with multibyte characters, the fastest way to find the beginning of a keyword is to use a 256-entry table, with one entry for each possible character. There are three possibilities:

The character can not appear in an identifier. (Continue the scan)
The character can appear in an identifier but no keyword starts with the character. (Skip the identifier)
The character can start a keyword. (Start walking the trie; if the walk cannot be continued, skip the identifier. If the walk finds a keyword and the next character cannot be in an identifier, skip the rest of the identifier; if it can be in an identifier, try continuing the walk if possible.)

Actually steps 2 and 3 are close enough together that you don't really need special logic.

There is some imprecision with the above algorithm because there are many cases where you find something that looks like an identifier but which syntactically cannot be. The most common cases are comments and quoted strings, but most languages have other possibilities. For example, in C you can have hexadecimal floating point numbers; while no C keyword can be constructed just from [a-f], a user-supplied word might be:

0x1.deadbeef

On the other hand, C++ allows user-defined numeric suffixes, which you might well want to recognize as keywords if the user adds them to the list:

274_myType

Beyond all of the above, it's really impractical to parse a million lines of code every time the user types a character in an editor. You need to develop some way of caching tokenization, and the simplest and most common one is to cache by input line. Keep the input lines in a linked list, and with every input line also record the tokenizer state at the beginning of the line (i.e., whether you're in a multi-line quoted string; a multi-line comment, or some other special lexical state). Except in some very bizarre languages, edits cannot affect the token structure of lines preceding the edit, so for any edit you only need to retokenize the edited line and any subsequent lines whose tokenizer state has changed. (Beware of working too hard in the case of multi-line strings: it can create lots of visual noise to flip the entire display because the user types an unterminated quote.)

Note: For smallish (hundreds) numbers of keywords, a full trie doesn't really take up that much space, but at some point you need to deal with bloated branches. One very reasonable datastructure, which can be made to perform very well if you're careful about data layout, is a ternary search tree (although I'd call it a ternary search trie.)

trie looks like a very promising solution, thank you. About your second section, I do plan on implementing an evaluation function to adjust the tokens (see question: `ctrl + f Scan the`) without re-scanning. Also, typing in a `*/` could potentially affect the lines previous, but I see your point. — tay10r, Aug 21 '13 at 04:06
@TaylorFlores: How could `*/` affect previous lines? It terminates the comment which was started in a possibly previous line, but it doesn't make the previous line suddenly a comment or not-a-comment. (Avoid detecting unterminated comments or quotes as errors and doing something with them. It also leads to unnecessary visual noise; 99.99% of the time, the user is just about to terminate them. A strategy I've used is to avoid recolouring following unedited lines even if the lex state changes until the user stops typing for some amount of time, in the hope that it won't be needed.) — rici, Aug 21 '13 at 04:10
"keyword matches are aligned; you can't have overlapping matches because you only match a complete identifier" -- the question included mention of searching for English language words, which can certainly overlap. — Jim Balter, Aug 21 '13 at 04:27
@JimBalter: They can't overlap *if they have to be full words*; in that sense, they are like keywords. I admit that I was interpreting the OP, but I think my interpretation is based on reasonable evidence. It's arguable that one should answer questions with the literally correct answer even if that is obviously contrary to the questioner's actual requirements; personally I prefer not to (or, in some cases, to provide both answers) but that's just me. (Would you expect an English word colourizer to colourize `bead` in `f0bead042`? -- no answer required.) — rici, Aug 21 '13 at 04:34
You're making too much of this ... I simply pointed out a case not covered; I didn't say that you shouldn't answer the question or that your answer isn't a good one. — Jim Balter, Aug 21 '13 at 04:51
@rici I forgot to mention that the trie solution is great. Here's a [review of the code](http://codereview.stackexchange.com/questions/30469/improving-trie-implementation) in codereview, if you're interested in seeing the performance. In some cases, depending on how many words are in the trie, it parses more than two times faster than the way I was using before. — tay10r, Sep 19 '13 at 15:35
@TaylorFlores: I responded to your code review with a little suggestion which I think will improve execution time a bit, but I'm glad the idea worked well. The only problem with trie's is the storage consumption, but there are some compression techniques that don't cost that much. One simple one is to use indices into a vector of nodes instead of node pointers; if you don't have too many keywords, the indices will fit into a uint16_t, which occupies 25% of the space of a pointer (on 64-bit architecture). — rici, Sep 19 '13 at 17:29

score 2 · Answer 2 · answered Aug 23 '13 at 17:41

2

It will be hard to beat this code.

Suppose your keywords are "a", "ax", and "foo".

Take the list of keywords, sorted, and feed it into a program that prints out code like this:

switch(pc[0]){
  break; case 'a':{
    if (0){
    } else if (strcmp(pc, "a")==0 && !alphanum(pc[1])){
      // push "a"
      pc += 1;
    } else if (strcmp(pc, "ax")==0 && !alphanum(pc[2])){
      // push "ax"
      pc += 2;
    }
  }
  break; case 'f':{
    if (0){
    } else if (strcmp(pc, "foo")==0 && !alphanum(pc[3])){
      // push "foo"
      pc += 3;
    }
    // etc. etc.
  }
  // etc. etc.
}

Then if you don't see a keyword, just increment pc and try again. The point is, by dispatching on the first character, you quickly get into the subset of keywords starting with that character. You might even want to go to two levels of dispatch.

Of course, as always, take some stack samples to see what the time is being used for. Regardless, if you have data structure classes, you're going to find that consuming a large part of your time, so keep that to a minimum (throw religion to the wind :)

answered Aug 23 '13 at 17:41

Mike Dunlavey

40,059
14
91
135

I see what you mean. I'm working on a method with hashes right now, that I haven't posted yet, and it may be pretty efficient as well. – tay10r Aug 25 '13 at 16:48
@Taylor: Hash coding will be a fun exercise, but in terms of instruction cycles per input character, the code here will be hard to beat unless you have millions of keywords. By the time you've generated the hash code of the input word, you've already spent more cycles. Where hash coding wins with strings is if they are stored on slow media, like a database. – Mike Dunlavey Aug 25 '13 at 18:35
after using the trie (which was fun writing), I actually ended up using this approach. While it took a while to write (1000 lines so far), it is **much** faster then any either of the last approaches I used (including the trie). Thanks again! – tay10r Oct 14 '13 at 02:10
1

@Taylor: Great! And now you've got a whole new skill they don't teach in school - partial evaluation - how to write a program to write your program! – Mike Dunlavey Oct 14 '13 at 12:52

score 1 · Answer 3 · answered Aug 21 '13 at 03:08

1

The fastest way to do it would be a finite state machine built to the word set. Use Lex to build the FSM.

answered Aug 21 '13 at 03:08

Jiminion

5,080
1
31
54

True, but there are a whole lot of details missing there, and lots of ways to do it that *won't* be fastest. These issues have already been addressed by the big-name algorithms.Edit: lex is a good suggestion, but flex is preferable. – Jim Balter Aug 21 '13 at 03:11
I guess the only bugaboo is the dynamic string (like it's being edited?). In that case, note the edit areas and keep the old token stream up until you are some distance from the change areas, and re-tokenize. – Jiminion Aug 21 '13 at 03:21

score 0 · Answer 4 · answered Aug 21 '13 at 03:15

0

The best algorithm for this problem is probably Aho-Corasick. There already exist C implementations, e.g.,

http://sourceforge.net/projects/multifast/

answered Aug 21 '13 at 03:15

Jim Balter

16,163
3
43
66

thanks for the link. I'm still reading up on it. It looks like I can't dynamically add keywords to the list, which is kind of a draw because then I can't add user defined keywords (see question: `ctrl+f user-defined tokens`). Is this true? – tay10r Aug 21 '13 at 03:32

Optimizing a word parser

4 Answers4