What is the best PySpark practice to filter spark dataframe for voting system on certain columns?

Question

I'm struggling to create an efficient voting system on spark dataframe to apply on certain columns for selecting desired records/rows . Let's say my data looks as below:

+----------+-----------+-------------+----------+----------+
|   Names  |     A     |      B      |     C    |    D     |
+----------+-----------+-------------+----------+----------+
|        X1|       true|         true|      true|      true|
|        X5|      false|         true|      true|     false|
|        X2|      false|        false|     false|     false|
|        X3|       true|         true|      true|     false|
|        X4|      false|        false|     false|      true|
|        X5|      false|         true|      true|     false|
|        X3|       true|         true|      true|     false|
+----------+-----------+-------------+----------+----------+

I want to create a new column name majority, which counts number of true for each rows and if it is:

more than 50%, label it abnormal
equal as true, label it 50-50
less than 50%, label it normal I try to inspire by this post in absence of DataFrame.mode in spark dataframe and make def df_mode(df) function as below using window partition as they they offered here & here: I tried to use regex to read true/false an count them but it is stupid:

def get_mode(df):
    counts = df.groupBy(['Names', 'A', 'B', 'C', 'D']).count().alias('count')
    #counts.show()
    win = Window().partitionBy('A', 'B', 'C', 'D').orderBy(F.col('count').desc())
    result = (counts
              .withColumn('majority', F.rank().over(win))
              #.where(F.col('majority) == 1)
              #.select('x', 'y')
             )
    #result.show()
        
    return result
df = get_mode(df)

def voting_sys(df):
    partition_columns = df.columns[1:]
    vote_results = (df  
    .withColumn('majority', F
        .over(Window.partitionBy(partition_columns))
        .when(F.isnull('majority'), '-')
        .when(F.regexp_extract('majority', '(?i)^true', 0) >  F.regexp_extract('majority', '(?i)^false', 0), 'abnormal')   
        .when(F.regexp_extract('majority', '(?i)^true', 0) == F.regexp_extract('majority', '(?i)^false', 0), '50-50')       
        .when(F.regexp_extract('majority', '(?i)^true', 0) <  F.regexp_extract('majority', '(?i)^false', 0), 'normal') 
        #.otherwise('normal') 
                                                                                  
    #.show()
)
        
    return vote_results

Note: I'm not interested to hack it using df.toPandas().

You could try converting the `bools` to `int`, then taking an average of the four columns and storing it in a fourth column. This would give you a number between 0 and 1 — whege, Oct 14 '21 at 19:10
@LiamFiddler You mean to encode the A, B, C, D columns? that might be expensive (encoding and scaling by vectorizing) but thanks for you input. — Mario, Oct 14 '21 at 19:12
No, you can simply cast the dtype from Boolean to Integer. No encoding necessary. — whege, Oct 14 '21 at 19:15
I'm assuming the true / false are stored as Booleans rather than strings; are they stored as strings? — whege, Oct 14 '21 at 19:19
Short question: in general sorting with *priority* using `orderBy()` is expensive than using `filtering()` in spark frame if I want to sort the records/rows? — Mario, Oct 14 '21 at 21:06

pltc · Accepted Answer · 2021-10-14T22:58:10.540

I love the idea of @LiamFiddler casting Boolean to Integer. However, I'm not recommending using UDF here, it's unnecessary.

First, you can convert from string to Boolean, then from Boolean to Integer, like so F.col(c).cast('boolean').cast('int'), this transformation I believe is not as expensive as it might sound like.

Second, you don't have to hardcode the columns here (A, B, C, D), you can do the sum sum(F.col(c) for c in cols]

This is my working code

cols = df.columns[1:]
# ['A', 'B', 'C', 'D']

(df
    .withColumn('sum', sum([F.col(c).cast('boolean').cast('int') for c in cols]))
    .withColumn('majority', F
        .when(F.col('sum')  > len(cols) / 2, 'abnormal')
        .when(F.col('sum') == len(cols) / 2, '50-50')
        .when(F.col('sum')  < len(cols) / 2, 'normal')
    )
    
    # order by abnormal, 50-50, normal
    .orderBy(F
        .when(F.col('majority') == 'abnormal', 1)
        .when(F.col('majority') == '50-50', 2)
        .when(F.col('majority') == 'normal', 3)
    )
    .show()
)

# Output
# +-----+-----+-----+-----+-----+---+--------+
# |Names|    A|    B|    C|    D|sum|majority|
# +-----+-----+-----+-----+-----+---+--------+
# |   X3| true| true| true|false|  3|abnormal|
# |   X3| true| true| true|false|  3|abnormal|
# |   X1| true| true| true| true|  4|abnormal|
# |   X5|false| true| true|false|  2|   50-50|
# |   X5|false| true| true|false|  2|   50-50|
# |   X4|false|false|false| true|  1|  normal|
# |   X2|false|false|false|false|  0|  normal|
# +-----+-----+-----+-----+-----+---+--------+

Honestly, I was thinking of using using `XOR` or `XNOR` to resolve this issue but you nail it. So by this, I don't need to use to calculate dataframe **mode** using `groubBy()` to get `count()` (in general no needs to the 1st function `def get_mode(df)`). Is it Ok if you form your answer `def voting_sys(df):` to have a solid answer. maybe sort the rows by `abnormal`, then `50-50` & `normal`. — Mario, Oct 14 '21 at 20:58
Well I'm not a big fan of wrapping transformations to functions like that. You can consider using [transform](http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.transform) function instead of Python function. — pltc, Oct 14 '21 at 22:53

whege · Answer 2 · 2021-10-14T19:25:44.410

0

My recommendation would be casting the Booleans as Integers, then taking an average of the four columns, which gives you a number between 0 and 1.

# if the values are stored as strings, you'll need a UDF to convert bool:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import IntegerType

def true_false(s):
    return int(s.lower() == 'true')

tf_udf = udf(lambda x: true_false(x), IntegerType())

for c in ['A', 'B', 'C', 'D']:
    df.withColumn(c, tf_udf(col(c)))

df.withColumn('majority', (df.A + df.B + df.C + df.D) / 4)

edited Oct 14 '21 at 19:25

answered Oct 14 '21 at 19:18

whege

1,391
1
5
13

I prefer to avoid using UDF if it's possible. UDF is like a black box to PySpark, and thus spark cannot efficiently apply optimizations on them. However, I just noticed that your idea about converting `boolean` values into `int` using *casting* (not encoding) is not expensive and make sense. – Mario Oct 14 '21 at 20:45
Cool, wasn't sure if the values were already stored as bools or not, but if so, then casting to ints is quite easy. – whege Oct 14 '21 at 20:51

What is the best PySpark practice to filter spark dataframe for voting system on certain columns?

2 Answers2