I am working with pyspark dataframes.
I have a list of date type values:
date_list = ['2018-01-19', '2018-01-20', '2018-01-17']
Also I have a dataframe (mean_df) that has only one column (mean).
+----+
|mean|
+----+
|67 |
|78 |
|98 |
+----+
Now I want to convert date_list into a column and join with mean_df:
expected output:
+------------+----+
|dates |mean|
+------------+----+
|2018-01-19 | 67|
|2018-01-20 | 78|
|2018-01-17 | 98|
+------------+----+
I tried converting list to dataframe (date_df) :
date_df = spark.createDataFrame([(l,) for l in date_list], ['dates'])
and then used monotonically_increasing_id() with new column name "idx" for both date_df and mean_df and used join :
date_df = mean_df.join(date_df, mean_df.idx == date_df.idx).drop("idx")
I get error of timeout exceeded so I changed default broadcastTimeout 300s to 6000s
spark.conf.set("spark.sql.broadcastTimeout", 6000)
But it did not work at all. Also I am working with a really small sample of data right now. The actual data is large enough.
Snippet of code:
date_list = ['2018-01-19', '2018-01-20', '2018-01-17']
mean_list = []
for d in date_list:
h2_df1, h2_df2 = hypo_2(h2_df, d, 2)
mean1 = h2_df1.select(_mean(col('count_before')).alias('mean_before'))
mean_list.append(mean1)
mean_df = reduce(DataFrame.unionAll, mean_list)