15

I want to pivot a spark dataframe, I refer pyspark documentation, and based on pivot function, the clue is .groupBy('name').pivot('name', values=None). Here's my dataset,

 In[75]:  spDF.show()
 Out[75]:

+-----------+-----------+
|customer_id|       name|
+-----------+-----------+
|      25620| MCDonnalds|
|      25620|  STARBUCKS|
|      25620|        nan|
|      25620|        nan|
|      25620| MCDonnalds|
|      25620|        nan|
|      25620| MCDonnalds|
|      25620|DUNKINDONUT|
|      25620|   LOTTERIA|
|      25620|        nan|
|      25620| MCDonnalds|
|      25620|DUNKINDONUT|
|      25620|DUNKINDONUT|
|      25620|        nan|
|      25620|        nan|
|      25620|        nan|
|      25620|        nan|
|      25620|   LOTTERIA|
|      25620|   LOTTERIA|
|      25620|  STARBUCKS|
+-----------+-----------+
only showing top 20 rows

And then I try to di pivot the table name

In [96]:
spDF.groupBy('name').pivot('name', values=None)
Out[96]:
<pyspark.sql.group.GroupedData at 0x7f0ad03750f0>

And when I try to show them

In [98]:
spDF.groupBy('name').pivot('name', values=None).show()
Out [98]:

    ---------------------------------------------------------------------------
AttributeError                       Traceback (most recent call last)
<ipython-input-98-94354082e956> in <module>()
----> 1 spDF.groupBy('name').pivot('name', values=None).show()
AttributeError: 'GroupedData' object has no attribute 'show'

I don't know why 'GroupedData' can't be shown, what should I do to solve the issue?

Nabih Bawazir
  • 6,381
  • 7
  • 37
  • 70

2 Answers2

27

The pivot() method returns a GroupedData object, just like groupBy(). You cannot use show() on a GroupedData object without using an aggregate function (such as sum() or even count()) on it before.

See this article for more information

ech0
  • 512
  • 4
  • 14
  • 1
    Thanks for the explanation, upvoted, still waiting for best answer – Nabih Bawazir Aug 13 '18 at 11:15
  • So what is the best answer that you are looking for? This is the only answer as to why you can't show it. The object that is being returned in fact does not have the attribute `show` and that is why the error is prompted. Think of it this way, lets say that you create a class and create a method for that class. Lets call this method `getName`. Now, try calling that method on a different type of object, for example, a string. It won't work, because the string doesn't have the method. That IS the only and best explanation. – ech0 Aug 13 '18 at 11:15
  • 3
    Or were you hoping for me to type the code that you can just copy-paste in and get it working? – ech0 Aug 13 '18 at 11:20
  • working code is the best answer, still reading the article, basically I only give best answer when my problem comp-letely solved – Nabih Bawazir Aug 13 '18 at 11:22
  • 3
    That is actually incorrect. The best answer points you to the direction. The easiest answer will give you the code off the bat. Now that you know what the problem is (HINT: you have to use an aggregate function) you can learn by solving this and not ever get this problem again. If I give you the functioning code off the bat you have no idea whats going on and will keep on tripping over this, as this is a quite common thing. – ech0 Aug 13 '18 at 11:24
  • @cromaerror, Agree with your point, The best answer points you to the direction to write the working code, best answer by clue will be given if the questioner can apply the advice, still try the code from the article – Nabih Bawazir Aug 13 '18 at 11:52
  • 1
    Could you please add an example? Thanks – Yamur Aug 15 '19 at 11:25
  • 2
    Do a .agg function on the grouped data and take what you want from that df. For example, .agg(max("userid")).select("userId").show(). – Leyth G Apr 30 '20 at 16:37
6

Let's create some test data that resembles your dataset:

data = [
    ("123", "McDonalds"),
    ("123", "Starbucks"),
    ("123", "McDonalds"),
    ("777", "McDonalds"),
    ("777", "McDonalds"),
    ("777", "Dunkin")
]
df = spark.createDataFrame(data, ["customer_id", "name"])
df.show()
+-----------+---------+
|customer_id|     name|
+-----------+---------+
|        123|McDonalds|
|        123|Starbucks|
|        123|McDonalds|
|        777|McDonalds|
|        777|McDonalds|
|        777|   Dunkin|
+-----------+---------+

Let's pivot the dataset so the customer_ids are columns:

df.groupBy("name").pivot("customer_id").count().show()

+---------+----+----+
|     name| 123| 777|
+---------+----+----+
|McDonalds|   2|   2|
|Starbucks|   1|null|
|   Dunkin|null|   1|
+---------+----+----+

Now let's pivot the DataFrame so the restaurant names are columns:

df.groupBy("customer_id").pivot("name").count().show()

+-----------+------+---------+---------+
|customer_id|Dunkin|McDonalds|Starbucks|
+-----------+------+---------+---------+
|        777|     1|        2|     null|
|        123|  null|        2|        1|
+-----------+------+---------+---------+

Code like df.groupBy("name").show() errors out with the AttributeError: 'GroupedData' object has no attribute 'show' message. You can only call methods defined in the pyspark.sql.GroupedData class on instances of the GroupedData class.

Powers
  • 18,150
  • 10
  • 103
  • 108