function regexp_extract
will take 3 parameters.
- Column value
- Regex Pattern
- Group Index
def regexp_extract(e: org.apache.spark.sql.Column,exp: String,groupIdx: Int): org.apache.spark.sql.Column
You are missing last parameter in regexp_extract
function.
Check below code.
scala> df.show(truncate=False)
+------------------------------------------+
|data |
+------------------------------------------+
|prm_2020 P02 United Kingdom London 2 for 2|
|prm_2020 P2 United Kingdom London 2 for 2 |
|prm_2020 P10 United Kingdom London 2 for 2|
|prm_2020 P11 United Kingdom London 2 for 2|
+------------------------------------------+
df
.withColumn("parsed_data",regexp_extract(col("data"),"(P[0-9]*)",0))
.show(truncate=False)
+------------------------------------------+-----------+
|data |parsed_data|
+------------------------------------------+-----------+
|prm_2020 P02 United Kingdom London 2 for 2|P02 |
|prm_2020 P2 United Kingdom London 2 for 2 |P2 |
|prm_2020 P10 United Kingdom London 2 for 2|P10 |
|prm_2020 P11 United Kingdom London 2 for 2|P11 |
+------------------------------------------+-----------+
df.createTempView("tbl")
spark
.sql("select data,regexp_extract(data,'(P[0-9]*)',0) as parsed_data from tbl")
.show(truncate=False)
+------------------------------------------+-----------+
|data |parsed_data|
+------------------------------------------+-----------+
|prm_2020 P02 United Kingdom London 2 for 2|P02 |
|prm_2020 P2 United Kingdom London 2 for 2 |P2 |
|prm_2020 P10 United Kingdom London 2 for 2|P10 |
|prm_2020 P11 United Kingdom London 2 for 2|P11 |
+------------------------------------------+-----------+