Extract multiple words using regexp_extract in PySpark

Question

I have a list which contains some words and I need to extract matching words from a text line, I found this, but it only extracts one word.

keys file content

this is a keyword

part_description file content

32015 this is a keyword hello world

Code

import pyspark.sql.functions as F

keywords = sc.textFile('file:///home/description_search/keys') #1
part_description =  sc.textFile('file:///description_search/part_description') #2
keywords = keywords.map(lambda x: x.split(' ')) #3
keywords = keywords.collect()[0] #4
df = part_description.map(lambda r: Row(r)).toDF(['line']) #5
df.withColumn('extracted_word', F.regexp_extract(df['line'],'|'.join(keywords), 0)).show() #6

Outputs

+--------------------+--------------+
|                line|extracted_word|
+--------------------+--------------+
|32015   this is a...|          this|
+--------------------+--------------+

Expected output

+--------------------+-----------------+
|                line|   extracted_word|
+--------------------+-----------------+
|32015   this is a...|this,is,a,keyword|
+--------------------+-----------------+

I want to

return all matching keyword and their count
and if step #4 is the most effecient way

Reproducible example:

keywords = ['this','is','a','keyword']
l = [('32015 this is a keyword hello world'      , ),
('keyword this'      ,   ),
('32015 this is a keyword hello world 32015 this is a keyword hello world'      ,   ),
('keyword keyword'      ,   ),
('is a'      , )]

columns = ['line']

df=spark.createDataFrame(l, columns)

Could you please explain 1. a bit more? How should the final dataframe look like? How many keywords do you have? — cronoik, May 28 '19 at 20:48
What do you want to happen when a row contains 'keyword keyword'? — cronoik, May 29 '19 at 12:41
Btw. [regexp_extract_all](https://issues.apache.org/jira/browse/SPARK-24884) is not yet released, therefore a workaround is necessary. — cronoik, May 29 '19 at 12:43

ZygD · Answer 1 · 2022-06-06T13:16:55.697

In Spark 3.1+ regexp_extract_all is available:

regexp_extract_all(str, regexp[, idx]) - Extract all strings in the str that match the regexp expression and corresponding to the regex group index.

Your original question now could be solved like this:

re_pattern = '(' + '|'.join([fr'\\b{k}\\b' for k in keywords]) + ')'
df = df.withColumn('matched', F.expr(f"regexp_extract_all(line, '{re_pattern}', 1)"))
df = df.withColumn('count', F.size('matched'))

df.show()
#+--------------------+--------------------+-----+
#|                line|             matched|count|
#+--------------------+--------------------+-----+
#|32015 this is a k...|[this, is, a, key...|    4|
#|        keyword this|     [keyword, this]|    2|
#|32015 this is a k...|[this, is, a, key...|    8|
#|     keyword keyword|  [keyword, keyword]|    2|
#|                is a|             [is, a]|    2|
#+--------------------+--------------------+-----+

Just can't figure out how many escape is required. I tried to extract hash tag by "#[^ ,\n]+" but failed and fall back to UDF — Philip Mok, Mar 18 '22 at 13:38
I've just edited the answer - when working with regex patterns, it's better to use raw-strings. Now I know that one can combine both raw and f-string using prefix `fr`. When using raw strings, less backslashes are needed. — ZygD, Jun 06 '22 at 13:26

score 1 · Accepted Answer · answered May 29 '19 at 12:49

I managed to solve it by using UDF instead as below

def build_regex(keywords):
    res = '('
    for key in keywords:
        res += '\\b' + key + '\\b|'
    res = res[0:len(res) - 1] + ')'

    return res


def get_matching_string(line, regex):
    matches = re.findall(regex, line)
    return matches if matches else None


udf_func = udf(lambda line, regex: get_matching_string(line, regex),
               ArrayType(StringType()))

df = df.withColumn('matched', udf_func(df['line'], F.lit(build_regex(keywords)))).withColumn('count', F.size('matched'))

Result

+--------------------+--------------------+-----+
|                line|             matched|count|
+--------------------+--------------------+-----+
|32015    this is ...|[this, is, this, ...|    5|
|12832    Shb is a...|             [is, a]|    2|
|35015    this is ...|          [this, is]|    2|
+--------------------+--------------------+-----+

score 0 · Answer 3 · answered Mar 18 '22 at 16:11

pyspark REGEXP_EXTRACT_ALL with temp view

Create a temp view:

df.select("user_id","line").createOrReplaceTempView("temp")

Select from temp view, create a new temp view or dataset:

spark.sql("SELECT user_id,REGEXP_EXTRACT_ALL(line,'(#[a-zA-Z]+)',1) as MATCHED FROM temp").createOrReplaceTempView("temp2")

For this example I am using REGEXP_EXTRACT_ALL to extract hashtag

Extract multiple words using regexp_extract in PySpark

3 Answers3