Context:
I have a code/text editor than I'm trying to optimize. Currently, the bottleneck of the program is the language parser than scans all the keywords (there is more than one, but they're written generally the same).
On my computer, the the editor delays on files around 1,000,000
lines of code. On lower-end computers, like Raspberry Pi, the delay starts happening much sooner (I don't remember exactly, but I think around 10,000
lines of code). And although I've never quite seen documents larger than 1,000,000
lines of code, I'm sure that they're there and I want my program to be able to edit them.
Question:
This leads me to the question: what's the fastest way to scan for a list of words within large, dynamic string?
Here's some information that may affect the design of the algorithm:
- the keywords
- qualifying characters allowed to be part of a keyword, (I call them qualifiers)
- the large string
Bottleneck-solution:
This is (roughly) the method I'm currently using to parse strings:
// this is just an example, not an excerpt
// I haven't compiled this, I'm just writing it to
// illustrate how I'm currently parsing strings
struct tokens * scantokens (char * string, char ** tokens, int tcount){
int result = 0;
struct tokens * tks = tokens_init ();
for (int i = 0; string[i]; i++){
// qualifiers for C are: a-z, A-Z, 0-9, and underscore
// if it isn't a qualifier, skip it
while (isnotqualifier (string[i])) i++;
for (int j = 0; j < tcount; j++){
// returns 0 for no match
// returns the length of the keyword if they match
result = string_compare (&string[i], tokens[j]);
if (result > 0){ // if the string matches
token_push (tks, i, i + result); // add the token
// token_push (data_struct, where_it_begins, where_it_ends)
break;
}
}
if (result > 0){
i += result;
} else {
// skip to the next non-qualifier
// then skip to the beginning of the next qualifier
/* ie, go from:
'some_id + sizeof (int)'
^
to here:
'some_id + sizeof (int)'
^
*/
}
}
if (!tks->len){
free (tks);
return 0;
} else return tks;
}
Possible Solutions:
Contextual Solutions:
I'm considering the following:
Scan the large string once, and add a function to evaluate/adjust the tokens markers every time there is user input (instead of re-scanning the entire document over and over). I expect that this will fix the bottleneck because there is much less parsing involved. But, it doesn't completely fix the program, because the initial scan may still take a really long time.
Optimize token-scanning algorithm (see below)
I've also considered, but have rejected, these optimizations:
- Scanning the code that is only on the screen. Although this would fix the bottleneck, it would limit the ability to find user-defined tokens (ie variable names, function names, macros) that appear earlier on than where the screen starts.
- Switching the text into a linked list (a node per line), rather than a monolithic array. This doesn't really help the bottleneck. Although insertions/deletions would be quicker, the loss of indexed access slows down the parser. I think that, also, a monolithic array is more likely to be cached, than a broken-up list.
- Hard coding a scan-tokens function for every language. Although this could be the best optimization for performance, it's doesn't seem practical in a software development point of view.
Architectural solution:
With assembly language, a quicker way to parse these strings would be to load characters into registers and compare them 4
or 8
bytes at a time. There are some additional measures and precautions that would have to be taken into account, such as:
- Does the architecture support unaligned memory access?
- All strings would have to be of size
s
, wheres % word-size == 0
, to prevent reading violations - Others?
But these issues seem like they can be easily fixed. The only problem (other than the usual ones that come with writing in assembly language) is that it's not so much an algorithmic solution as it is a hardware solution.
Algorithmic Solution:
So far, I've considered having the program rearrange the list of keywords to make a binary search algorithm a little more possible.
One way I've thought about rearranging them for this is by switch the dimensions of the list of keywords. Here's an example of that in C
:
// some keywords for the C language
auto // keywords[0]
break // keywords[1]
case char const continue // keywords[2], keywords[3], keywords[4]
default do double
else enum extern
float for
goto
if int
long
register return
short signed sizeof static struct switch
typedef
union unsigned
void volatile
while
/* keywords[i] refers to the i-th keyword in the list
*
*/
Switching the dimensions of the two-dimensional array would make it look like this:
0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
-----------------------------------------------------------------
1 | a b c c c c d d d e e e f f g i i l r r s s s s s s t u u v v w
2 | u r a h o o e o o l n x l o o f n o e e h i i t t w y n n o o h
3 | t e s a n n f u s u t o r t t n g t o g z a r i p i s i l i
4 | o a e r s t a b e m e a o g i u r n e t u t e o i d a l
5 | k i u l r t s r t e o i c c d n g t e
6 | n l e n t n d f c t h e n i
7 | u t e f e l
8 | e r d e
// note that, now, keywords[0] refers to the string "abccccdddeeefffiilrr"
This makes it more efficient to use a binary search algorithm (or even a plain brute force algorithm). But it only words for the first characters in each keyword, afterwards nothing can be considered 'sorted'. This may help in small sets of words like an a programming language, but it wouldn't be enough for a larger set of words (like in the entire English language).
Is there more than can be done to improve this algorithm?
Is there another approach that can be taken to increase performance?
Notes:
This question from SO doesn't help me. The Boyer-Moore-Horspool algorithm (as I understand it) is an algorithm for finding a sub-string within a string. Since I'm parsing for multiple strings I think there's much more room for optimization.