0

This is a followup on Error: Failed to parse regular expression "": pattern too large - compile failed...

I am hitting this problem again.

Can anyone please please explain how BigQuery allocates resources for REGEXP functions? Is there a known issue (and ways to avoid it) of "too many"\"too complex" REGEXP expressions in a single BQ query?

Many Thanks

Community
  • 1
  • 1
N.N.
  • 3,094
  • 21
  • 41
  • This for doing splits? – Felipe Hoffa Apr 27 '14 at 22:58
  • No, the splits query was just a sample. I just have many different regexp expressions in my query. Only few of them are for doing splits. – N.N. Apr 28 '14 at 06:08
  • I personally don't have a specific answer to this specific question, but what you really want is better string handling functions? – Felipe Hoffa Apr 28 '14 at 18:46
  • Fh. Regex is not very common among SQL guys (like me) and we tend to look for alternatives. However, when you learn the dialect, RegEX functions are the most flexible string handling solution. what I miss is better understanding the limits of using it on BQ. Standard "Split" function will be useful, but i don't think its missing (if regex can handle splits). – N.N. Apr 29 '14 at 08:03

1 Answers1

2

Regular expression parsing has some unfortunate corner cases that can eat up a lot of processing cycles (see http://www.regular-expressions.info/catastrophic.html). The regular expression state machines can also eat up a lot of memory. Because of this, BigQuery has a budget for regular expressions. The current value is something like 2MB per query. This is a guideline, not a published guarantee, however, and is subject to change at any time.

BigQuery is documented to use Re2 (code here: https://code.google.com/p/re2/) for regular expression parsing; You can try running your regular expressions with this library using a budget of 2MB and check whether they succeed or not.

Jordan Tigani
  • 26,089
  • 4
  • 60
  • 63
  • Thanks Jordan. following your answer, few additional questions: Can you please elaborate how the 2MB are allocated (per regex function, per processed field, other... )? I would like to emphasis that the error is during compilation, before resources were consumed. How does BQ estimates the resource consumption? after optimizing my regex expressionsfor minimum resource consumption, do you have any suggestions for overcoming such limits? – N.N. Apr 29 '14 at 08:12
  • 1
    The 2MB are allocated per query. Note that this is for the data structures that hold the regular expressions, not for the actual parsing. The source code for re2 at the link above has more information about how memory budgets are used. Regarding how to optimize, I'd suggest using simpler regexps, replacing them with other string manipulation functions if possible, or breaking up the query into pieces. – Jordan Tigani Apr 29 '14 at 14:09
  • Many thanks. I now understand what type of measures i need to take to optimize my queries. May I suggest adding this info to the BQ Query Reference documentation. – N.N. Apr 29 '14 at 15:02