Regular expression to find Specific character in a string

Question

I have these sample values

prm_2020 P02 United Kingdom London 2 for 2
prm_2020 P2 United Kingdom London 2 for 2
prm_2020 P10 United Kingdom London 2 for 2
prm_2020 P11 United Kingdom London 2 for 2

Need to find P2, P02, P11,p06,p05 like this, trying to use Regexp_extract function in databricks. struggling to find the correct expression. Once i find P10, p6 from string i need to put numbers in new column called ID

select distinct
    promo_name
   ,regexp_extract(promo_name, '(?<=p\d+\s+)P\d+') as regexp_id
from stock
where promo_name is not null


select distinct
    promo_name
   ,regexp_extract(promo_name, 'P[0-9]+') as regexp_id
from stock
where promo_name is not null

both generating errors

https://stackoverflow.com/questions/63017975/string-comparison-in-databricks-spark-sql/63024352#63024352 — user3061338, Jul 22 '20 at 12:10

score 0 · Answer 1 · answered Jul 22 '20 at 10:23

0

The expression would be:

select regexp_extract(col, 'P[0-9]+')

answered Jul 22 '20 at 10:23

Gordon Linoff

1,242,037
58
646
786

Error in SQL statement: SparkException: Job aborted due to stage failure: Task 6 in stage 179.0 failed 4 times, most recent failure: Lost task 6.3 in stage 179.0 (TID 12754, 10.92.191.203, executor 23): java.lang.IndexOutOfBoundsException: No group 1 – user3061338 Jul 22 '20 at 10:30
you should probably deal with strings that will return no match in your select statement, by using some sort of IF NULL construct. – Veverke Jul 22 '20 at 10:31
thanks veverke, there is no null value in my column as it has already been filtered – user3061338 Jul 22 '20 at 10:37
@user3061338 . . . I do not see how a `regexp_extract()` could cause an out-of-bounds error. Something else seems to be going on. – Gordon Linoff Jul 22 '20 at 10:56

score 0 · Answer 2 · answered Jul 22 '20 at 10:26

One regex could be (?<=prm_\d+\s+)P\d+ Besides searching for strings in the form of P* where * is a digit, it also checks that such strings are preceded by strings in the form prm_* where * is a digit.

Keep in mind case sensitivity. The solution above IS case sensitive (if your input comes as PRM, then your match will be discarded.) I am not familiar with apache-spark but I assume it supports parameters such as /i as other platforms to indicate the regex should be case insensitive.

Regexr.com demo

score 0 · Answer 3 · answered Jul 22 '20 at 11:40

0

Just select the group 0

regexp_extract(promo_name, 'P[0-9]+',0)

answered Jul 22 '20 at 11:40

Shubham Jain

5,327
2
15
38

score 0 · Answer 4 · answered Jul 22 '20 at 13:13

function regexp_extract will take 3 parameters.

Column value
Regex Pattern
Group Index

def regexp_extract(e: org.apache.spark.sql.Column,exp: String,groupIdx: Int): org.apache.spark.sql.Column

You are missing last parameter in regexp_extract function.

Check below code.

scala> df.show(truncate=False)
+------------------------------------------+
|data                                      |
+------------------------------------------+
|prm_2020 P02 United Kingdom London 2 for 2|
|prm_2020 P2 United Kingdom London 2 for 2 |
|prm_2020 P10 United Kingdom London 2 for 2|
|prm_2020 P11 United Kingdom London 2 for 2|
+------------------------------------------+

df
.withColumn("parsed_data",regexp_extract(col("data"),"(P[0-9]*)",0))
.show(truncate=False)
+------------------------------------------+-----------+
|data                                      |parsed_data|
+------------------------------------------+-----------+
|prm_2020 P02 United Kingdom London 2 for 2|P02        |
|prm_2020 P2 United Kingdom London 2 for 2 |P2         |
|prm_2020 P10 United Kingdom London 2 for 2|P10        |
|prm_2020 P11 United Kingdom London 2 for 2|P11        |
+------------------------------------------+-----------+

df.createTempView("tbl")
spark
.sql("select data,regexp_extract(data,'(P[0-9]*)',0) as parsed_data from tbl")
.show(truncate=False)
+------------------------------------------+-----------+
|data                                      |parsed_data|
+------------------------------------------+-----------+
|prm_2020 P02 United Kingdom London 2 for 2|P02        |
|prm_2020 P2 United Kingdom London 2 for 2 |P2         |
|prm_2020 P10 United Kingdom London 2 for 2|P10        |
|prm_2020 P11 United Kingdom London 2 for 2|P11        |
+------------------------------------------+-----------+

Regular expression to find Specific character in a string

4 Answers4

Linked