I'm struggling to create an efficient voting system on spark dataframe to apply on certain columns for selecting desired records/rows . Let's say my data looks as below:
+----------+-----------+-------------+----------+----------+
| Names | A | B | C | D |
+----------+-----------+-------------+----------+----------+
| X1| true| true| true| true|
| X5| false| true| true| false|
| X2| false| false| false| false|
| X3| true| true| true| false|
| X4| false| false| false| true|
| X5| false| true| true| false|
| X3| true| true| true| false|
+----------+-----------+-------------+----------+----------+
I want to create a new column name majority
, which counts number of true
for each rows and if it is:
- more than 50%, label it
abnormal
- equal as
true
, label it50-50
- less than 50%, label it
normal
I try to inspire by this post in absence ofDataFrame.mode
in spark dataframe and makedef df_mode(df)
function as below using window partition as they they offered here & here: I tried to use regex to readtrue
/false
an count them but it is stupid:
def get_mode(df):
counts = df.groupBy(['Names', 'A', 'B', 'C', 'D']).count().alias('count')
#counts.show()
win = Window().partitionBy('A', 'B', 'C', 'D').orderBy(F.col('count').desc())
result = (counts
.withColumn('majority', F.rank().over(win))
#.where(F.col('majority) == 1)
#.select('x', 'y')
)
#result.show()
return result
df = get_mode(df)
def voting_sys(df):
partition_columns = df.columns[1:]
vote_results = (df
.withColumn('majority', F
.over(Window.partitionBy(partition_columns))
.when(F.isnull('majority'), '-')
.when(F.regexp_extract('majority', '(?i)^true', 0) > F.regexp_extract('majority', '(?i)^false', 0), 'abnormal')
.when(F.regexp_extract('majority', '(?i)^true', 0) == F.regexp_extract('majority', '(?i)^false', 0), '50-50')
.when(F.regexp_extract('majority', '(?i)^true', 0) < F.regexp_extract('majority', '(?i)^false', 0), 'normal')
#.otherwise('normal')
#.show()
)
return vote_results
Note: I'm not interested to hack it using df.toPandas()
.