Custom sorting in pyspark dataframes

Question

Are there any recommended methods for implementing custom sort ordering for categorical data in pyspark? I'm ideally looking for the functionality the pandas categorical data type offers.

So, given a dataset with a Speed column, the possible options are ["Super Fast", "Fast", "Medium", "Slow"]. I want to implement custom sorting that will fit the context.

If I use the default sorting the categories will be sorted alphabetically. Pandas allows to change the column data type to be categorical and part of the definition gives a custom sort order: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html

u wont get a general solution like the one u have in pandas. for pyspark you can orderby numerics or alphabets, so using your speed column, we could create a new column with superfast as 1, fast as 2, medium as 3, and slow as 4, and then sort on that.if you could provide sample data with a speed column, id be happy to provide you code — murtihash, Mar 05 '20 at 01:08

score 12 · Answer 1 · edited Jul 16 '22 at 04:25

12

You can use orderBy and define your custom ordering using when:

from pyspark.sql.functions import col, when

df.orderBy(when(col("Speed") == "Super Fast", 1)
           .when(col("Speed") == "Fast", 2)
           .when(col("Speed") == "Medium", 3)
           .when(col("Speed") == "Slow", 4)
           )

edited Jul 16 '22 at 04:25

Community

1
1

answered Mar 05 '20 at 21:13

blackbishop

30,945
11
55
76

Custom sorting in pyspark dataframes

1 Answers1

Linked