How to check multiple columns for matching regex rules to output in another column in PySpark

Question

My dataframe:

I have to check if the value in each column matches a certain rule. For example:

If column 'a' has a number, column 'b' has xx or yy, column 'c' has 1 or 2, and column 'd' has 0 -> then the output should be 'output1'
It is not necessary that all columns should have rules. If a rule does not exist then it should simply ignore it. E.g., for 'output3', it does not matter what is there in column 'c'.
If it does not match any rules, it should say 'no matches found'.

Since there are so many rules, I created a dictionary of regex rules as follows:

rules_dict = 
{'output1': {'a': '^[0-9]*$',
             'b': 'xx | yy',
             'c': '^[1-2]*$',
             'd': '0'},

 'output2': {'a': '^[a-z]+$',
             'b': 'xx | yy',
             'c': '1',
             'd': '0'},

 'output3': {'a': '^[a-zA-Z0-9_.-]*$',
             'b': 'xx | yy',
             'd': '0'},

 'output4': {'a': '^[0-9]*$',
             'b': 'xx | yy',
             'c': '^[1-2]*$',
             'd': '0'}
}

The expected output:

I used the following PySpark script:

for out in rules_dict.keys():
    for column, rule in rules_dict[out].items():
        output_df = df.withColumn('output', F.when(df[column].rlike(rule), out).otherwise('no matches found'))
output_df.show()

But the output is:

P.S: I am doing it for a large dataset, with a large number of rules. I have only created a sample for simplifying the question.

score 3 · Answer 1 · answered Oct 20 '22 at 15:10

The thing is, you pass every rule separately to the column. After you pass the first, the second overwrites the first, etc. So, if the first yielded "output1", the second would overwrite it with "no matches found". The third would overwrite both first results and return just "output3" + "no matches found". So, effectively you see the result of the 4th rule. And since it has an error, you only get "no matches found".

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('123', 'xx', '1', '0'),
     ('abc', 'xx', '1', '0'),
     ('w23', 'yy', '1', '0'),
     ( '11', 'yy', '2', '0'),
     ( '23', 'xx', '2', '0'),
     ('abc', 'zz', '3', '1')],
    ['a', 'b', 'c', 'd'])

rules_dict = {
    'output1': {'a': '^[0-9]*$',
                'b': 'xx|yy',
                'c': '^[1-2]*$',
                'd': '0'},
    'output2': {'a': '^[a-z]+$',
                'b': 'xx|yy',
                'c': '1',
                'd': '0'},
    'output3': {'a': '^[a-zA-Z0-9_.-]*$',
                'b': 'xx|yy',
                'd': '0'},
    'output4': {'a': '^[0-9]*$',
                'b': 'xx|yy',
                'c': '^[1-2]*$',
                'd': '0'}
}

Script:

conds = F
for out in rules_dict:
    ands = F.lit(True)
    for c, p in rules_dict[out].items():
        ands &= F.col(c).rlike(p)
    conds = conds.when(ands, out)
conds = conds.otherwise('no matches found')

df = df.withColumn("output", conds)

df.show()
# +---+---+---+---+----------------+
# |  a|  b|  c|  d|          output|
# +---+---+---+---+----------------+
# |123| xx|  1|  0|         output1|
# |abc| xx|  1|  0|         output2|
# |w23| yy|  1|  0|         output3|
# | 11| yy|  2|  0|         output1|
# | 23| xx|  2|  0|         output1|
# |abc| zz|  3|  1|no matches found|
# +---+---+---+---+----------------+

I have fixed your regex patterns, but you still don't get "output4". That's because the first rule also is correct for that line. The first rule comes the first in the when chain, so it returns "output1" and other when conditions are not being evaluated.

Very elegant solution. Could you please help me understand: 1. why and what is it that you do in `conds = F`? I take the `F` is `from pyspark.sql import functions as F`, right? But why is it necessary? 2. In the 6th row I tried `conds = F.when(ands, out)` (so just substituted the first usage of `conds` with `F`) but this does not result in the desired output. Why is that, if `F` is just the reference to the functions? — Diego Aviles, Oct 20 '22 at 16:36
Thank you. 1. Yes, F represents the imported functions. `conds = F` is necessary at the beginning, because later you do `conds = conds.when(ands, out)`, i.e. `F.when(...`. 2. That's probably because again you would be overwriting previous `F.when` conditions and ending up only with the last rule. That whole `for` loop creates just one column expression encompassing all the rules and regex patterns. — ZygD, Oct 20 '22 at 17:22
Thank you for the answer, that's a really nice solution and thanks for explaining the conds in the comment! — K_Raikar, Oct 20 '22 at 22:21

score 3 · Accepted Answer · answered Oct 20 '22 at 15:30

3

@ZygD's answer is very good, but you could also use some for comprehensions and functools so that you don't have to for loop over the config. The other differences are that I am using coalesce to get the first not-null result:

df.withColumn("output", F.coalesce(
    *[
        F.when(
            functools.reduce(lambda x, y: x & y, [F.col(column).rlike(rule) for column, rule in rules_dict[out].items()]),
            F.lit(out)
        )
        for out in rules_dict.keys()
    ],
    F.lit('no matches found')
)).show()

answered Oct 20 '22 at 15:30

botchniaque

4,698
3
35
63

Thank you, your answer is clever too! – ZygD Oct 20 '22 at 17:26
That's a very elegant solution! Thank you! – K_Raikar Oct 20 '22 at 22:22

How to check multiple columns for matching regex rules to output in another column in PySpark

2 Answers2