2

I am trying to extract text that exists inside root level brackets from a string in Spark-SQL. I have used the function regexp_extract() on both Spark-SQL and Athena on the same string with the same regex.

On Athena, it's working fine.

But on Spark-SQL, it is not returning the value as expected.

Query is:

SELECT regexp_extract('Russia (Federal Service of Healthcare)', '.+\s\((.+)\)', 1) AS cl

Output On Athena:

Federal Service of Healthcare

Output on Spark-SQL:

ia (Federal Service of Healthcare)

I am bumping my head around but can't seem to find a solution around this.

Cyber Knight
  • 184
  • 2
  • 10

1 Answers1

1

This does the trick:

SELECT regexp_extract('Russia (Federal Service of Healthcare)', '.+\\\\s\\\\((.+)\\\\)', 1) AS cl

output:

+-----------------------------+
|cl                           |
+-----------------------------+
|Federal Service of Healthcare|
+-----------------------------+

The s is not being escaped in your example, that's why it falls as part of the group; you can also use the regexp_extract API directly which makes a cleaner solution:

.withColumn("cl", regexp_extract(col("name"), ".+\\s\\((.+)\\)", 1))

Good luck!