Creating binned histograms in Spark

Question

Suppose I have a dataframe (df) (Pandas) or RDD (Spark) with the following two columns:

timestamp, data
12345.0    10 
12346.0    12

In Pandas, I can create a binned histogram of different bin lengths pretty easily. For example, to create a histogram over 1 hr, I do the following:

df =  df[ ['timestamp', 'data'] ].set_index('timestamp')
df.resample('1H',how=sum).dropna()

Moving to Pandas df from Spark RDD is pretty expensive for me (considering the dataset). Consequently, I prefer to stay within the Spark domain as much as possible.

Is there a way to do the equivalent in Spark RDD or dataframes?

Spark RDD or DataFrame do not have index sadly and spark do not provide low level operation as pandas definitely no re sampling of ts . — WoodChopper, Dec 29 '15 at 06:53
There is a recent Cloudera Spark package on time series, it also has Python doc [here](http://cloudera.github.io/spark-timeseries/0.1.0/index.html). I don't know if it is what you are looking for, but it does say it's pandas-like functionality for time series. — KrisP, Dec 29 '15 at 14:47
WoodChopper: What do you mean by 'do not have index'? Are you referring to 'set_index' functionality as available in Pandas? — Neel, Dec 29 '15 at 16:19

score 3 · Answer 1 · edited Mar 09 '21 at 10:10

Spark >= 2.0

You can use window function

from pyspark.sql.functions import window

(df
    .groupBy(window("timestamp", "3 minute").alias("ts"))
    .sum()
    .orderBy("ts")
    .show())
## +--------------------+---------+
## |                  ts|sum(data)|
## +--------------------+---------+
## |{2000-01-01 00:00...|        3|
## |{2000-01-01 00:03...|       12|
## |{2000-01-01 00:06...|       21|
## +--------------------+---------+

(df
    .groupBy(window("timestamp", "3 minute").alias("ts"))
    .sum()
    .orderBy("ts")
    .show())

## +--------------------+---------+
## |                  ts|sum(data)|
## +--------------------+---------+
## |{2000-01-01 00:00...|       36|
## +--------------------+---------+

Spark < 2.0

In this particular case all you need is Unix timestamps and basic arithmetics:

from pyspark.sql.functions import timestamp_seconds

def resample_to_minute(c, interval=1):
    t = 60 * interval
    # For Spark < 3.1 
    # return (floor(c / t) * t).cast("timestamp")
    return timestamp_seconds(floor(c / t) * t)

def resample_to_hour(c, interval=1):
    return resample_to_minute(c, 60 * interval)

df = sc.parallelize([
    ("2000-01-01 00:00:00", 0), ("2000-01-01 00:01:00", 1),
    ("2000-01-01 00:02:00", 2), ("2000-01-01 00:03:00", 3),
    ("2000-01-01 00:04:00", 4), ("2000-01-01 00:05:00", 5),
    ("2000-01-01 00:06:00", 6), ("2000-01-01 00:07:00", 7),
    ("2000-01-01 00:08:00", 8)
]).toDF(["timestamp", "data"])

(df.groupBy(resample_to_minute(unix_timestamp("timestamp"), 3).alias("ts"))
    .sum().orderBy("ts").show(3, False))

## +---------------------+---------+
## |ts                   |sum(data)|
## +---------------------+---------+
## |2000-01-01 00:00:00.0|3        |
## |2000-01-01 00:03:00.0|12       |
## |2000-01-01 00:06:00.0|21       |
## +---------------------+---------+

(df.groupBy(resample_to_hour(unix_timestamp("timestamp")).alias("ts"))
    .sum().orderBy("ts").show(3, False))
## +---------------------+---------+
## |ts                   |sum(data)|
## +---------------------+---------+
## |2000-01-01 00:00:00.0|36       |
## +---------------------+---------+

Example data from pandas.DataFrame.resample documentation.

In general case see Making histogram with Spark DataFrame column

score 0 · Answer 2 · answered Apr 11 '16 at 21:18

Here is an answer using RDDs and not dataframes:

# Generating some data to test with 
import random
import datetime

startTS = 12345.0
array = [(startTS+60*k, random.randrange(10, 20)) for k in range(150)]

# Initializing a RDD
rdd = sc.parallelize(array)

# I first map the timestamps to datetime objects so I can use the datetime.replace 
# method to round the times
formattedRDD = (rdd
                .map(lambda (ts, data): (datetime.fromtimestamp(int(ts)), data))
                .cache())

# Putting the minute and second fields to zero in datetime objects is 
# exactly like rounding per hour. You can then reduceByKey to aggregate bins.
hourlyRDD = (formattedRDD
             .map(lambda (time, msg): (time.replace(minute=0, second=0), 1))
             .reduceByKey(lambda a, b : a + b))

hourlyHisto = hourlyRDD.collect()
print hourlyHisto
> [(datetime.datetime(1970, 1, 1, 4, 0), 60), (datetime.datetime(1970, 1, 1, 5, 0), 55), (datetime.datetime(1970, 1, 1, 3, 0), 35)]

In order to do daily aggregates you can use time.date() instead of time.replace(...). Also to bin per hour starting at a not-round date-time object you can increment the original time by the delta to the nearest round hour.

Creating binned histograms in Spark

2 Answers2

Linked

Related