Regex in Spark SQL is resulting in wrong value but working fine on Amazon Athena

Question

I am trying to extract text that exists inside root level brackets from a string in Spark-SQL. I have used the function regexp_extract() on both Spark-SQL and Athena on the same string with the same regex.

On Athena, it's working fine.

But on Spark-SQL, it is not returning the value as expected.

Query is:

SELECT regexp_extract('Russia (Federal Service of Healthcare)', '.+\s\((.+)\)', 1) AS cl

Output On Athena:

Federal Service of Healthcare

Output on Spark-SQL:

ia (Federal Service of Healthcare)

I am bumping my head around but can't seem to find a solution around this.

maybe this? https://stackoverflow.com/a/58529615. so you need `'.+\\\s\\\((.+)\\\)'` in Spark SQL. — Emma, Jul 22 '22 at 17:20

score 1 · Accepted Answer · answered Jul 22 '22 at 18:38

This does the trick:

SELECT regexp_extract('Russia (Federal Service of Healthcare)', '.+\\\\s\\\\((.+)\\\\)', 1) AS cl

output:

+-----------------------------+
|cl                           |
+-----------------------------+
|Federal Service of Healthcare|
+-----------------------------+

The s is not being escaped in your example, that's why it falls as part of the group; you can also use the regexp_extract API directly which makes a cleaner solution:

.withColumn("cl", regexp_extract(col("name"), ".+\\s\\((.+)\\)", 1))

Good luck!

Regex in Spark SQL is resulting in wrong value but working fine on Amazon Athena

1 Answers1