I'm using the regexp_extract Spark 2.2 SQL function in a Jupyter (Scala) notebook to match a string of 11 or more repeating characters.
Here's the regex:
^(.)\1{10,}$
Now, let's look at that pattern with the regexp_extract function. Here's how I've used it in my notebook:
spark.sql("SELECT REGEXP_EXTRACT('hhhhhhhhhhhhh', '^(.)\\1{10,}$', 1) as ExtractedChar").show()
+-------------+
|ExtractedChar|
+-------------+
| |
+-------------+
Odd, no output. Let's make sure my regex pattern is actually correct. Yep, looks right.
You may be wondering why the regex pattern contains two "\\" characters, it's because it is an escape character so two are necessary. Here's some verification:
1. val string = "SELECT REGEXP_EXTRACT('hhhhhhhhhhhhhhhhhhhhh', '^(.)\\1{10,}$', 1) as ExtractedChar"
2. println(string)
SELECT REGEXP_EXTRACT('hhhhhhhhhhhhhhhhhhhhh', '^(.)\1{10,}$', 1) as ExtractedChar
Alright, let's make sure the regexp_extract function is working correctly:
spark.sqlContext.sql("SELECT REGEXP_EXTRACT('TESTING', '^.', 0) as test").show()
+----+
|test|
+----+
| T|
+----+
Okay, maybe the issue is the Jupyter notebook? After checking and using the Scala REPL, I'm still having the same issue.
Any ideas why I'm unable to get this regex to successfully match?
Edit: Spark SQL is a requirement for this. I could create my own UDF using Scala; however, UDFs are black boxed by Spark meaning they will not be fully optimized.