Spark 2.2/Jupyter Notebook SQL regexp_extract function not matching regex pattern

Question

I'm using the regexp_extract Spark 2.2 SQL function in a Jupyter (Scala) notebook to match a string of 11 or more repeating characters.

Here's the regex:

^(.)\1{10,}$

Now, let's look at that pattern with the regexp_extract function. Here's how I've used it in my notebook:

spark.sql("SELECT REGEXP_EXTRACT('hhhhhhhhhhhhh', '^(.)\\1{10,}$', 1) as ExtractedChar").show()

+-------------+
|ExtractedChar|
+-------------+
|             |
+-------------+

Odd, no output. Let's make sure my regex pattern is actually correct. Yep, looks right.

You may be wondering why the regex pattern contains two "\\" characters, it's because it is an escape character so two are necessary. Here's some verification:

1. val string = "SELECT REGEXP_EXTRACT('hhhhhhhhhhhhhhhhhhhhh', '^(.)\\1{10,}$', 1) as ExtractedChar"
2. println(string)
SELECT REGEXP_EXTRACT('hhhhhhhhhhhhhhhhhhhhh', '^(.)\1{10,}$', 1) as ExtractedChar

Alright, let's make sure the regexp_extract function is working correctly:

spark.sqlContext.sql("SELECT REGEXP_EXTRACT('TESTING', '^.', 0) as test").show()
+----+
|test|
+----+
|   T|
+----+

Okay, maybe the issue is the Jupyter notebook? After checking and using the Scala REPL, I'm still having the same issue.

Any ideas why I'm unable to get this regex to successfully match?

Edit: Spark SQL is a requirement for this. I could create my own UDF using Scala; however, UDFs are black boxed by Spark meaning they will not be fully optimized.

Any reason you aren't using [scala.util.matching.Regex](http://www.scala-lang.org/api/2.9.2/scala/util/matching/Regex.html) ? — NH., Sep 11 '17 at 20:16
Yes, Spark SQL is required in this case to fit correctly into our procedure. I'll edit the post to include that. — lemon master, Sep 11 '17 at 20:22
no, I meant, doing whatever you are trying to do with the regex using Scala, and then sticking that in to the SQL query string. — NH., Sep 11 '17 at 20:36

score 6 · Accepted Answer · answered Sep 11 '17 at 20:36

6

I found the solution. The SQL string needs to include 4 "\" characters, like so:

'^(.)\\\\1{10,}$'

answered Sep 11 '17 at 20:36

lemon master

209
2
5
11

It works but any doc to explain why this is necessary? – mohit Dec 24 '18 at 16:08
1

@mohit: the same query works in hive with 2 back slash, but in spark sql it works with 4 back slash.. interesting – Shankar Jul 25 '19 at 09:09

score 3 · Answer 2 · answered Sep 01 '20 at 16:29

As explained here, four \ characters are needed because \ for two reasons:

\ is a special character in SQL and needs to be escaped, so the query needs two of them.
The input is coming from a string where \ also needs to be escaped. Just having "\\" would give a single \. To get two you need "\\\\".

Spark 2.2/Jupyter Notebook SQL regexp_extract function not matching regex pattern

2 Answers2