How to divide a numerical columns in ranges and assign labels for each range in apache spark?

Question

I have the following sparkdataframe:

id weekly_sale
1    40000
2    120000
3    135000
4    211000
5    215000
6    331000
7    337000

I need to see in which of the following intervals items in weekly_sale column fall:

under 100000
between 100000 and 200000
between 200000 and 300000
more than 300000

so my desired output would be like:

id weekly_sale  label
1    40000       under 100000    
2    120000      between 100000 and 200000
3    135000      between 100000 and 200000
4    211000      between 200000 and 300000
5    215000      between 200000 and 300000
6    331000      more than 300000
7    337000      more than 300000

any pyspark, spark.sql and Hive context implementation will help me.

So have you tried anything? There are many ways to achieve these and none is particularly complex. — zero323, Sep 23 '16 at 08:44
@zero323 actually i am just thinking to register DF as a table and then use CASE WHEN .. what is your idea? — chessosapiens, Sep 23 '16 at 08:50
Assuming you have got just 4 cases, you can just invoke map on RDD and transform those ints into strings with if statements. — bear911, Sep 23 '16 at 09:02

zero323 · Accepted Answer · 2016-09-23T21:19:11.173

Assuming ranges and labels are defined as follows:

splits = [float("-inf"), 100000.0, 200000.0, 300000.0, float("inf")]
labels = [
    "under 100000", "between 100000 and 200000", 
    "between 200000 and 300000", "more than 300000"]

df = sc.parallelize([
    (1, 40000.0), (2, 120000.0), (3, 135000.0),
    (4, 211000.0), (5, 215000.0), (6, 331000.0),
    (7, 337000.0)
]).toDF(["id", "weekly_sale"])

one possible approach is to use Bucketizer:

from pyspark.ml.feature import Bucketizer
from pyspark.sql.functions import array, col, lit

bucketizer = Bucketizer(
    splits=splits, inputCol="weekly_sale", outputCol="split"
)

with_split = bucketizer.transform(df)

and attach labels later:

label_array = array(*(lit(label) for label in labels))

with_split.withColumn(
    "label", label_array.getItem(col("split").cast("integer"))
).show(10, False)

## +---+-----------+-----+-------------------------+
## |id |weekly_sale|split|label                    |
## +---+-----------+-----+-------------------------+
## |1  |40000.0    |0.0  |under 100000             |
## |2  |120000.0   |1.0  |between 100000 and 200000|
## |3  |135000.0   |1.0  |between 100000 and 200000|
## |4  |211000.0   |2.0  |between 200000 and 300000|
## |5  |215000.0   |2.0  |between 200000 and 300000|
## |6  |331000.0   |3.0  |more than 300000         |
## |7  |337000.0   |3.0  |more than 300000         |
## +---+-----------+-----+-------------------------+

There are of course different ways you can achieve the same goal. For example you can create a lookup table:

from toolz import sliding_window
from pyspark.sql.functions import broadcast

mapping = [
    (lower, upper, label) for ((lower, upper), label)
    in zip(sliding_window(2, splits), labels)
]

lookup_df =sc.parallelize(mapping).toDF(["lower", "upper", "label"])

df.join(
    broadcast(lookup_df),
    (col("weekly_sale") >= col("lower")) & (col("weekly_sale") < col("upper"))
).drop("lower").drop("upper")

or generate lookup expression:

from functools import reduce
from pyspark.sql.functions import when

def in_range(c):
    def in_range_(acc, x):        
        lower, upper, label = x
        return when(
            (c >= lit(lower)) & (c < lit(upper)), lit(label)
        ).otherwise(acc)
    return in_range_

label = reduce(in_range(col("weekly_sale")), mapping, lit(None))

df.withColumn("label", label)

The least efficient approach is an UDF.

i am going to test it, thanks a lot , i was trying when otherwise function but it seems that when only takes one condition, how can i give it several conditions? df_label= quantile_filtered.withColumn("label", func.when(quantile_filtered.total_sale<100000 , "under 100000").otherwise(0)) — chessosapiens, Sep 23 '16 at 09:23
how do you evaluate this one:df_label= quantile_filtered.withColumn("label", func.when(quantile_filtered.total_sale<3500, "under 3500").when((quantile_filtered.total_sale>=3500) & (quantile_filtered.total_sale<3800) , "3500 to 3800").when((quantile_filtered.total_sale>=3800)&(quantile_filtered.total_sale<6000),"3800 to 6000").when(quantile_filtered.total_sale>=6000,"above 6000")) — chessosapiens, Sep 23 '16 at 10:44
So there is nothing to evaluate. I am not sure if I understand what you want here. — zero323, Sep 25 '16 at 11:41
@ zero323 maybe i used the wrong word , what i meant was that how do you compare my solution with your solutions — chessosapiens, Sep 25 '16 at 20:46

How to divide a numerical columns in ranges and assign labels for each range in apache spark?

1 Answers1