Whait does Pyspark rdd command do?

Question

I'm just getting started with pyspark and I have a question (maybe it's too easy but I can't see it) I have a dataframe of animal species with columns: 'category', 'name' and 'status' and I'm using this command to obtain some info on the category columns:

df.groupBy('category').count().show()

yielding:

+-----------------+-----+
|         category|count|
+-----------------+-----+
|   Vascular Plant| 4470|
|             Bird|  521|
|           Mammal|  214|
|        Amphibian|   80|
|Nonvascular Plant|  333|
|             Fish|  127|
|          Reptile|   79|
+-----------------+-----+

then I used this line:

df.select('category').rdd.countByValue()

and got this:

defaultdict(int,
        {Row(category='Bird'): 521,
         Row(category='Reptile'): 79,
         Row(category='Fish'): 127,
         Row(category='Vascular Plant'): 4470,
         Row(category='Nonvascular Plant'): 333,
         Row(category='Amphibian'): 80,
         Row(category='Mammal'): 214})

So my question is: what does the 'rdd' part add to the code?

score 1 · Accepted Answer · answered Dec 02 '20 at 05:08

1

RDD is the logical representation of dataset in spark. It is stored across multiple machines could be servers too in case of a cluster. These are immutable and can be recovered in case of a failure.

A dataset is a data externally loaded by the user. Could come from any source be it a database a simple text file.

Please refer the following link:

Spark Notes

answered Dec 02 '20 at 05:08

Huzefa Sadikot

561
1
7
22

ok, so by adding 'rdd' I'm 'saving' the resulting set of information in the multiple machine system that I'm working with. am I correct? – lealvcon Dec 02 '20 at 05:15
Reference of that values – Huzefa Sadikot Dec 02 '20 at 05:16
https://stackoverflow.com/questions/34433027/what-is-rdd-in-spark – Huzefa Sadikot Dec 02 '20 at 05:17

score 0 · Answer 2 · answered Dec 02 '20 at 05:06

0

I believe you're converting the spark dataframe into an rdd object by invoking the .rdd method. This is why you get a defaultdict back (a subclass of dict) instead of a table.

See this SO post for more details on the function doing the converting.

answered Dec 02 '20 at 05:06

willwrighteng

1,411
11
25

Whait does Pyspark rdd command do?

2 Answers2