20

I am creating a syntax highlighter, and I am using String.split to create tokens from an input string. The first issue is that String.split creates a huge amount of empty strings, which causes everything to be quite slower than it could otherwise be.

For example, "***".split(/(\*)/) -> ["", "*", "", "*", "", "*", ""]. Is there a way to avoid this?

Another issue is the expression precedence in the regular expression itself. Let's say I am trying to parse a C style multi-line comment. That is, /* comment */. Now let's assume the input string is "/****/". If I were to use the following regular expression, it would work, but produce a lot of extra tokens (and all those empty strings!).

/(\/\*|\*\/|\*)/

A better way is to read /*'s, */'s and then read all the rest of the *'s in one token. That is, the better result for the above string is ["/*", "**", "*/"]. However, when using the regular expression that should do this, I get bad results. The regular expression is like so: /(\/\*|\*\/|\*+)/.

The result of this expression is however this: ["/*", "***", "/"]. I am guessing this is because the last part is greedy so it steals the match from the other part.

The only solution I found was to make a negated lookahead expression, like this:

/(\/\*|\*\/|\*+(?!\/)/

This gives the expected result, but it is very slow compared to the other one, and this has an effect for big strings.

Is there a solution for either of these problems?

user2503048
  • 1,021
  • 1
  • 10
  • 22

2 Answers2

25

Use lookahed to avoid empty matches:

arr = "***".split(/(?=\*)/);
//=> ["*", "*", "*"]

OR use filter(Boolean) to discard empty matches:

arr = "***".split(/(\*)/).filter(Boolean);
//=> ["*", "*", "*"]
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    The lookahead sadly doesn't let me to get multiple things in the same token, e.g. "***" as one token in the above example. I used filter() previously, but the additional array traversal actually made the whole code slower than it is when doing one iteration while skipping the empty strings. – user2503048 Nov 11 '13 at 23:54
  • 1
    Anu -- sorry to bother you with this, but can you explain or point me to a resource that explains the .filter(Boolean) concept. Are the empty strings returned as false etc.. I'm slightly confused by your usage. – james emanon Nov 12 '13 at 01:31
  • 3
    Empty strings evaluate as false in boolean expressions. – user2503048 Nov 12 '13 at 01:34
  • @user2503048: Can you provide some example inputs and expected matches in your question. I am sure lookahead can still be used with those captures. – anubhava Nov 12 '13 at 06:29
  • 1
    arr = "***".split(''); Can also be used to tokenize the individual characters of a string. – SridharKritha Jul 15 '20 at 07:27
  • 1
    @anubhava You are responding to an answer from 7 years ago :) I don't remember what I ended up doing at the time, but at the end of the day tokenizing with regular expressions is just not the best, better to make a specific tokenizer for whatever input you want (e.g. iterate character by character and know when tokens start/end), it is significantly faster and more efficient than regexes – user2503048 Jul 16 '20 at 11:59
  • Thanks for your comment. I just visited this answer due to some other reason and wondered whether it worked or not. Yes, writing your own parser may be the only option sometimes when input is not structured. – anubhava Jul 16 '20 at 13:55
4

Generally for tokenizing you use match, not split:

> str = "/****/"
"/****/"
> str.match(/(\/\*)(.*?)(\*\/)/)
["/****/", "/*", "**", "*/"]

Also note how the non-greedy modifier ? solves the second problem.

georg
  • 211,518
  • 52
  • 313
  • 390
  • I am tokenizing gigantic strings, the C style multiline comment was just one example, how would that regex fit with the rest? (the rest being {}()[]+-/&^=! and so on). In the mean time I just stopped grabbing all *'s with the greedy modifier, while the rest of the tokens do use it. The worst case is when there are many of *'s, but since there are usually not many of them, I guess I can live with that. For the record, this is currently my regex: `/[ \t]+|\[+|\]+|\{+|\}+|\(+|\)+|\++|\-+|\*|\/+|<+|>+|&+|\|+|=+|!+|\\/` – user2503048 Nov 12 '13 at 01:00
  • @user2503048: well, there are quite a few syntax highlighters in javascript, maybe you can start by studying their source code? Your regexp looks sub-optimal. – georg Nov 12 '13 at 01:23