I ran into this same exact issue/error trying to use prophet
on an AWS EMR Spark cluster (using a jupyter notebook interface). After much trouble shooting, we realized this is because Spark is expecting back a particular data format—I believe a json with particular fields—but prophet
returns a pandas
dataframe.
I fixed this issue by writing a user-defined function (udf) in pyspark that allows me to use prophet on a Spark data frame and specify what data will be returned from this Spark function.
I based my own solution on the pandas_udf
functions for prophet
on Spark in this example and this example.
Below is a generalized version of the function I wrote. For clarity, I was trying to fit a timeseries model on the data I had in order to detect outliers, hence why I fit and then predict on the same data. You'll also need to make sure pyarrow
is installed to handle the pandas_udf
properly in Spark:
# Import relevant packages
import pyspark.sql.functions as F
import pyspark.sql.types as types
import prophet
# Define output schema of prophet model
output_schema = types.StructType([
types.StructField('id', types.IntegerType(), True), #args: name (string), data type, nullable (boolean)
types.StructField('ds', types.TimestampType(), True),
types.StructField('yhat', types.DoubleType(), True),
types.StructField('yhat_lower', types.DoubleType(), True),
types.StructField('yhat_upper', types.DoubleType(), True)
])
# Function to fit Prophet timeseries model
@F.pandas_udf(output_schema, F.PandasUDFType.GROUPED_MAP)
def fit_prophet_model(df):
"""
:param df: spark dataframe containing our the data we want to model.
:return: returns spark dataframe following the output_schema.
"""
# Prep the dataframe for use in Prophet
formatted_df = df[['timestamp', 'value_of_interest']] \
.rename(columns = {'timestamp': 'ds', 'value_of_interest': 'y'}) \
.sort_values(by = ['ds'])
# Instantiate model
model = prophet.Prophet(interval_width = 0.99,
growth = 'linear',
daily_seasonality = True,
weekly_seasonality = True,
yearly_seasonality = True,
seasonality_mode = 'multiplicative')
# Fit model and get fitted values
model.fit(formatted_df)
model_results = model.predict(formatted_df)[['ds', 'yhat', 'yhat_lower', 'yhat_upper']] \
.sort_values(by = ['ds'])
model_results['id'] = formatted_df['id'] #add grouping id
model_results = model_results[['id', 'ds', 'yhat', 'yhat_lower', 'yhat_upper']] #get columns in correct order
return model_results
Then to run the function on your data simply do the following:
results = (my_data.groupBy('id') \
.apply(fit_prophet_model)
)
results.show(10) #show first ten rows of the fitted model results