5

I have a pyspark dataframe as follows in the picture:

enter image description here

I.e. i have four columns: year, word, count, frequency. The year is from 2000 to 2015.

I could like to have some operation on the (pyspark) dataframe so that i get the result in a format as the following picture:

enter image description here

The new dataframe column should be : word, frequency_2000, frequency_2001, frequency_2002, ..., frequency_2015.

With the frequency of each word in each year coming from previous dataframe.

Any advice how I could write efficient code?

Also, please rename the title if you could come up some more informative.

XYZ
  • 352
  • 5
  • 19

2 Answers2

8

After some research, I found a solution: enter image description here

XYZ
  • 352
  • 5
  • 19
0

Now, the crosstab function can get the output directly :

topw_ys.crosstab("word", "year").toPandas()

Results:

 word_year  2000    2015
 0  mining  10      6
 1  system  11      12
 ...
mountrix
  • 1,126
  • 15
  • 32