I have this dataset with a column of array type. From this column, we need to create another column which will have list of unique elements and its counts.
Example [a,b,e,b]
results should be [[b,a,e],[2,1,1]]
. Data should be sorted by count. Even key value where value is the count will do. I created a udf
(please see below) for this purpose, but it is very slow so I need to do this in PySpark built-in functions.
id | col_a | collected_col_a |
---|---|---|
1 | a | [a, b, e, b] |
1 | b | [a, b, e, b] |
struct_schema1 = StructType([
StructField('elements', ArrayType(StringType()), nullable=True),
StructField('count', ArrayType(IntegerType()), nullable=True)
])
# udf
@udf(returnType=struct_schema1)
def func1(x, top = 10):
y,z=np.unique(x,return_counts=True)
z_y = zip(z.tolist(), y.tolist())
y = [i for _, i in sorted(z_y, reverse = True)]
z = sorted(z.tolist(), reverse = True)
if len(y) > top:
return {'elements': y[:top],'count': z[:top]}
else:
return {'elements': y,'count': z}