7

Are there any recommended methods for implementing custom sort ordering for categorical data in pyspark? I'm ideally looking for the functionality the pandas categorical data type offers.

So, given a dataset with a Speed column, the possible options are ["Super Fast", "Fast", "Medium", "Slow"]. I want to implement custom sorting that will fit the context.

If I use the default sorting the categories will be sorted alphabetically. Pandas allows to change the column data type to be categorical and part of the definition gives a custom sort order: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html

blackbishop
  • 30,945
  • 11
  • 55
  • 76
Daveed
  • 149
  • 2
  • 8
  • u wont get a general solution like the one u have in pandas. for pyspark you can orderby numerics or alphabets, so using your speed column, we could create a new column with superfast as 1, fast as 2, medium as 3, and slow as 4, and then sort on that.if you could provide sample data with a speed column, id be happy to provide you code – murtihash Mar 05 '20 at 01:08

1 Answers1

12

You can use orderBy and define your custom ordering using when:

from pyspark.sql.functions import col, when

df.orderBy(when(col("Speed") == "Super Fast", 1)
           .when(col("Speed") == "Fast", 2)
           .when(col("Speed") == "Medium", 3)
           .when(col("Speed") == "Slow", 4)
           )
Community
  • 1
  • 1
blackbishop
  • 30,945
  • 11
  • 55
  • 76