Pyspark dataframe: crosstab or other method to make row label as new columns

Question

I have a pyspark dataframe as follows in the picture:

I.e. i have four columns: year, word, count, frequency. The year is from 2000 to 2015.

I could like to have some operation on the (pyspark) dataframe so that i get the result in a format as the following picture:

The new dataframe column should be : word, frequency_2000, frequency_2001, frequency_2002, ..., frequency_2015.

With the frequency of each word in each year coming from previous dataframe.

Any advice how I could write efficient code?

Also, please rename the title if you could come up some more informative.

score 8 · Answer 1 · answered Dec 10 '18 at 22:25

8

After some research, I found a solution:

answered Dec 10 '18 at 22:25

XYZ

2

to facilitate the copy/paste : topw_yes.groupBy("word").pivot("year").agg(first("count")) – Arnaud Hureaux Feb 28 '22 at 13:17

score 0 · Answer 2 · answered Feb 03 '23 at 21:51

0

Now, the crosstab function can get the output directly :

topw_ys.crosstab("word", "year").toPandas()

Results:

 word_year  2000    2015
 0  mining  10      6
 1  system  11      12
 ...

answered Feb 03 '23 at 21:51

mountrix

2 Answers2