1

Suppose I have a dataframe (df) (Pandas) or RDD (Spark) with the following two columns:

timestamp, data
12345.0    10 
12346.0    12

In Pandas, I can create a binned histogram of different bin lengths pretty easily. For example, to create a histogram over 1 hr, I do the following:

df =  df[ ['timestamp', 'data'] ].set_index('timestamp')
df.resample('1H',how=sum).dropna()

Moving to Pandas df from Spark RDD is pretty expensive for me (considering the dataset). Consequently, I prefer to stay within the Spark domain as much as possible.

Is there a way to do the equivalent in Spark RDD or dataframes?

zero323
  • 322,348
  • 103
  • 959
  • 935
Neel
  • 9,913
  • 16
  • 52
  • 74
  • Spark RDD or DataFrame do not have index sadly and spark do not provide low level operation as pandas definitely no re sampling of ts . – WoodChopper Dec 29 '15 at 06:53
  • 1
    There is a recent Cloudera Spark package on time series, it also has Python doc [here](http://cloudera.github.io/spark-timeseries/0.1.0/index.html). I don't know if it is what you are looking for, but it does say it's pandas-like functionality for time series. – KrisP Dec 29 '15 at 14:47
  • WoodChopper: What do you mean by 'do not have index'? Are you referring to 'set_index' functionality as available in Pandas? – Neel Dec 29 '15 at 16:19

2 Answers2

3

Spark >= 2.0

You can use window function

from pyspark.sql.functions import window

(df
    .groupBy(window("timestamp", "3 minute").alias("ts"))
    .sum()
    .orderBy("ts")
    .show())
## +--------------------+---------+
## |                  ts|sum(data)|
## +--------------------+---------+
## |{2000-01-01 00:00...|        3|
## |{2000-01-01 00:03...|       12|
## |{2000-01-01 00:06...|       21|
## +--------------------+---------+

(df
    .groupBy(window("timestamp", "3 minute").alias("ts"))
    .sum()
    .orderBy("ts")
    .show())

## +--------------------+---------+
## |                  ts|sum(data)|
## +--------------------+---------+
## |{2000-01-01 00:00...|       36|
## +--------------------+---------+

Spark < 2.0

In this particular case all you need is Unix timestamps and basic arithmetics:

from pyspark.sql.functions import timestamp_seconds

def resample_to_minute(c, interval=1):
    t = 60 * interval
    # For Spark < 3.1 
    # return (floor(c / t) * t).cast("timestamp")
    return timestamp_seconds(floor(c / t) * t)

def resample_to_hour(c, interval=1):
    return resample_to_minute(c, 60 * interval)

df = sc.parallelize([
    ("2000-01-01 00:00:00", 0), ("2000-01-01 00:01:00", 1),
    ("2000-01-01 00:02:00", 2), ("2000-01-01 00:03:00", 3),
    ("2000-01-01 00:04:00", 4), ("2000-01-01 00:05:00", 5),
    ("2000-01-01 00:06:00", 6), ("2000-01-01 00:07:00", 7),
    ("2000-01-01 00:08:00", 8)
]).toDF(["timestamp", "data"])

(df.groupBy(resample_to_minute(unix_timestamp("timestamp"), 3).alias("ts"))
    .sum().orderBy("ts").show(3, False))

## +---------------------+---------+
## |ts                   |sum(data)|
## +---------------------+---------+
## |2000-01-01 00:00:00.0|3        |
## |2000-01-01 00:03:00.0|12       |
## |2000-01-01 00:06:00.0|21       |
## +---------------------+---------+

(df.groupBy(resample_to_hour(unix_timestamp("timestamp")).alias("ts"))
    .sum().orderBy("ts").show(3, False))
## +---------------------+---------+
## |ts                   |sum(data)|
## +---------------------+---------+
## |2000-01-01 00:00:00.0|36       |
## +---------------------+---------+

Example data from pandas.DataFrame.resample documentation.

In general case see Making histogram with Spark DataFrame column

10465355
  • 4,481
  • 2
  • 20
  • 44
zero323
  • 322,348
  • 103
  • 959
  • 935
0

Here is an answer using RDDs and not dataframes:

# Generating some data to test with 
import random
import datetime

startTS = 12345.0
array = [(startTS+60*k, random.randrange(10, 20)) for k in range(150)]

# Initializing a RDD
rdd = sc.parallelize(array)

# I first map the timestamps to datetime objects so I can use the datetime.replace 
# method to round the times
formattedRDD = (rdd
                .map(lambda (ts, data): (datetime.fromtimestamp(int(ts)), data))
                .cache())

# Putting the minute and second fields to zero in datetime objects is 
# exactly like rounding per hour. You can then reduceByKey to aggregate bins.
hourlyRDD = (formattedRDD
             .map(lambda (time, msg): (time.replace(minute=0, second=0), 1))
             .reduceByKey(lambda a, b : a + b))

hourlyHisto = hourlyRDD.collect()
print hourlyHisto
> [(datetime.datetime(1970, 1, 1, 4, 0), 60), (datetime.datetime(1970, 1, 1, 5, 0), 55), (datetime.datetime(1970, 1, 1, 3, 0), 35)]

In order to do daily aggregates you can use time.date() instead of time.replace(...). Also to bin per hour starting at a not-round date-time object you can increment the original time by the delta to the nearest round hour.

Ben. B.
  • 51
  • 3