I'm just getting started with pyspark and I have a question (maybe it's too easy but I can't see it) I have a dataframe of animal species with columns: 'category', 'name' and 'status' and I'm using this command to obtain some info on the category columns:
df.groupBy('category').count().show()
yielding:
+-----------------+-----+
| category|count|
+-----------------+-----+
| Vascular Plant| 4470|
| Bird| 521|
| Mammal| 214|
| Amphibian| 80|
|Nonvascular Plant| 333|
| Fish| 127|
| Reptile| 79|
+-----------------+-----+
then I used this line:
df.select('category').rdd.countByValue()
and got this:
defaultdict(int,
{Row(category='Bird'): 521,
Row(category='Reptile'): 79,
Row(category='Fish'): 127,
Row(category='Vascular Plant'): 4470,
Row(category='Nonvascular Plant'): 333,
Row(category='Amphibian'): 80,
Row(category='Mammal'): 214})
So my question is: what does the 'rdd' part add to the code?