0

I am working with pyspark dataframes.

I have a list of date type values:

date_list = ['2018-01-19', '2018-01-20', '2018-01-17']

Also I have a dataframe (mean_df) that has only one column (mean).

+----+
|mean|
+----+
|67  |
|78  |
|98  |
+----+

Now I want to convert date_list into a column and join with mean_df:

expected output:

+------------+----+
|dates       |mean|
+------------+----+
|2018-01-19  |  67|
|2018-01-20  |  78|
|2018-01-17  |  98|
+------------+----+

I tried converting list to dataframe (date_df) :

date_df = spark.createDataFrame([(l,) for l in date_list], ['dates'])

and then used monotonically_increasing_id() with new column name "idx" for both date_df and mean_df and used join :

date_df = mean_df.join(date_df, mean_df.idx == date_df.idx).drop("idx")

I get error of timeout exceeded so I changed default broadcastTimeout 300s to 6000s

spark.conf.set("spark.sql.broadcastTimeout", 6000)

But it did not work at all. Also I am working with a really small sample of data right now. The actual data is large enough.

Snippet of code:

date_list = ['2018-01-19', '2018-01-20', '2018-01-17']
mean_list = []


for d in date_list:
    h2_df1, h2_df2 = hypo_2(h2_df, d, 2)
    
    mean1 = h2_df1.select(_mean(col('count_before')).alias('mean_before'))
   
    mean_list.append(mean1)
    
    
mean_df = reduce(DataFrame.unionAll, mean_list)
mck
  • 40,932
  • 13
  • 35
  • 50
Samiksha
  • 139
  • 2
  • 9
  • Spark dataframes are unordered. How could you know which row in the mean_df dataframe correspond to which date? – mck Jan 27 '21 at 13:57
  • @mck the mean values are calculated using a for loop. and that for loop takes date from date_list. so the three means basically correspond to the sequence of dates in the list. – Samiksha Jan 27 '21 at 14:00
  • mean was collected in a list a first. then it was converted to a dataframe mean_df. I want date and mean together in a dataframe for further analysis. – Samiksha Jan 27 '21 at 14:06
  • when you convert it to a dataframe, you're losing the order in the list, and so assigning the date becomes impossible. You need to include the date when you collect the mean into a list. Could you show the code for that? – mck Jan 27 '21 at 14:10
  • @mck sure, I just added the snippet in the post. – Samiksha Jan 27 '21 at 14:23
  • what is `_mean`? – mck Jan 27 '21 at 14:28
  • I read that it is an in-built function to get the mean of entire column of pyspark dataframe (https://stackoverflow.com/questions/47995188/how-to-calculate-mean-and-standard-deviation-given-a-pyspark-dataframe/47995478) – Samiksha Jan 27 '21 at 14:30
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/227899/discussion-between-sameeksha-sohal-and-mck). – Samiksha Jan 27 '21 at 14:31
  • I prefer using `F.mean` instead of `_mean`. See my answer below. – mck Jan 27 '21 at 14:34

1 Answers1

1

You can use withColumn and lit to add the date to the dataframe:

import pyspark.sql.functions as F

date_list = ['2018-01-19', '2018-01-20', '2018-01-17']
mean_list = []

for d in date_list:
    h2_df1, h2_df2 = hypo_2(h2_df, d, 2)
    
    mean1 = h2_df1.select(F.mean(F.col('count_before')).alias('mean_before')).withColumn('date', F.lit(d))
   
    mean_list.append(mean1)
    
    
mean_df = reduce(DataFrame.unionAll, mean_list)
mck
  • 40,932
  • 13
  • 35
  • 50