-1

I am learning pySpark. I am trying to aggregate some data. Below is the code I tried.

from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType

spark = SparkSession.builder.appName("learning").master("local[*]").getOrCreate()

path = "deliveries.csv"

text_df = spark.read.csv(path,sep=",", header=True)

temp_df = text_df.withColumn("runs", text_df["batsman_runs"].cast(IntegerType()))
temp_df.show()
temp_df.cache()

print(temp_df.describe())
print(temp_df.dtypes)

temp_df.groupby('batsman').agg(sum('runs')).show()

Below is the data from file with added column 'runs'.

+--------+------+-------------------+--------------------+----+----+------------+------------+-----------+-------------+---------+--------+-----------+-----------+------------+------------+----------+----------+----------------+--------------+-------------+----+
|match_id|inning|       batting_team|        bowling_team|over|ball|     batsman| non_striker|     bowler|is_super_over|wide_runs|bye_runs|legbye_runs|noball_runs|penalty_runs|batsman_runs|extra_runs|total_runs|player_dismissed|dismissal_kind|      fielder|runs|
+--------+------+-------------------+--------------------+----+----+------------+------------+-----------+-------------+---------+--------+-----------+-----------+------------+------------+----------+----------+----------------+--------------+-------------+----+
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   1|   1|   DA Warner|    S Dhawan|   TS Mills|            0|        0|       0|          0|          0|           0|           0|         0|         0|            null|          null|         null|   0|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   1|   2|   DA Warner|    S Dhawan|   TS Mills|            0|        0|       0|          0|          0|           0|           0|         0|         0|            null|          null|         null|   0|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   1|   3|   DA Warner|    S Dhawan|   TS Mills|            0|        0|       0|          0|          0|           0|           4|         0|         4|            null|          null|         null|   4|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   1|   4|   DA Warner|    S Dhawan|   TS Mills|            0|        0|       0|          0|          0|           0|           0|         0|         0|            null|          null|         null|   0|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   1|   5|   DA Warner|    S Dhawan|   TS Mills|            0|        2|       0|          0|          0|           0|           0|         2|         2|            null|          null|         null|   0|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   1|   6|    S Dhawan|   DA Warner|   TS Mills|            0|        0|       0|          0|          0|           0|           0|         0|         0|            null|          null|         null|   0|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   1|   7|    S Dhawan|   DA Warner|   TS Mills|            0|        0|       0|          1|          0|           0|           0|         1|         1|            null|          null|         null|   0|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   2|   1|    S Dhawan|   DA Warner|A Choudhary|            0|        0|       0|          0|          0|           0|           1|         0|         1|            null|          null|         null|   1|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   2|   2|   DA Warner|    S Dhawan|A Choudhary|            0|        0|       0|          0|          0|           0|           4|         0|         4|            null|          null|         null|   4|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   2|   3|   DA Warner|    S Dhawan|A Choudhary|            0|        0|       0|          0|          1|           0|           0|         1|         1|            null|          null|         null|   0|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   2|   4|   DA Warner|    S Dhawan|A Choudhary|            0|        0|       0|          0|          0|           0|           6|         0|         6|            null|          null|         null|   6|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   2|   5|   DA Warner|    S Dhawan|A Choudhary|            0|        0|       0|          0|          0|           0|           0|         0|         0|       DA Warner|        caught|Mandeep Singh|   0|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   2|   6|MC Henriques|    S Dhawan|A Choudhary|            0|        0|       0|          0|          0|           0|           0|         0|         0|            null|          null|         null|   0|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   2|   7|MC Henriques|    S Dhawan|A Choudhary|            0|        0|       0|          0|          0|           0|           4|         0|         4|            null|          null|         null|   4|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   3|   1|    S Dhawan|MC Henriques|   TS Mills|            0|        0|       0|          0|          0|           0|           1|         0|         1|            null|          null|         null|   1|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   3|   2|MC Henriques|    S Dhawan|   TS Mills|            0|        0|       0|          0|          0|           0|           0|         0|         0|            null|          null|         null|   0|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   3|   3|MC Henriques|    S Dhawan|   TS Mills|            0|        0|       0|          0|          0|           0|           0|         0|         0|            null|          null|         null|   0|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   3|   4|MC Henriques|    S Dhawan|   TS Mills|            0|        0|       0|          0|          0|           0|           3|         0|         3|            null|          null|         null|   3|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   3|   5|    S Dhawan|MC Henriques|   TS Mills|            0|        0|       0|          0|          0|           0|           1|         0|         1|            null|          null|         null|   1|
|       1|     1|Sunrisers Hyderabad|Royal Challengers...|   3|   6|MC Henriques|    S Dhawan|   TS Mills|            0|        0|       0|          0|          0|           0|           1|         0|         1|            null|          null|         null|   1|
+--------+------+-------------------+--------------------+----+----+------------+------------+-----------+-------------+---------+--------+-----------+-----------+------------+------------+----------+----------+----------------+--------------+-------------+----+

I tried to get the sum of runs grouping by batsman. But, I got below error.

Traceback (most recent call last):
  File "ipl.py", line 19, in <module>
    temp_df.groupby('batsman').agg(sum('runs')).show()
TypeError: unsupported operand type(s) for +: 'int' and 'str'

As it showing like datatype conversion from string to Int in column runs, I checked the dataframe columns(describe and dtypes).

Both showing different datatypes. Notice last column.

print(temp_df.describe())

DataFrame[summary: string, match_id: string, inning: string, batting_team: string, bowling_team: string, over: string, ball: string, batsman: string, non_striker: string, bowler: string, is_super_over: string, wide_runs: string, bye_runs: string, legbye_runs: string, noball_runs: string, penalty_runs: string, batsman_runs: string, extra_runs: string, total_runs: string, player_dismissed: string, dismissal_kind: string, fielder: string, runs: string]

print(temp_df.dtypes)

[('match_id', 'string'), ('inning', 'string'), ('batting_team', 'string'), ('bowling_team', 'string'), ('over', 'string'), ('ball', 'string'), ('batsman', 'string'), ('non_striker', 'string'), ('bowler', 'string'), ('is_super_over', 'string'), ('wide_runs', 'string'), ('bye_runs', 'string'), ('legbye_runs', 'string'), ('noball_runs', 'string'), ('penalty_runs', 'string'), ('batsman_runs', 'string'), ('extra_runs', 'string'), ('total_runs', 'string'), ('player_dismissed', 'string'), ('dismissal_kind', 'string'), ('fielder', 'string'), ('runs', 'int')]

Why the datatypes are not converting after cast? Why describe and dtypes showing different?

Praveen L
  • 937
  • 6
  • 13
  • how about using InferSchema true option, if not then attached your file as a downloadable form – Mahesh Gupta Dec 06 '19 at 11:33
  • @Mahesh Gupta I tried your suggession, but, still the problem is same. In Dtypes, datatypes converted as `int`. But still the aggregation failing with same error. – Praveen L Dec 06 '19 at 11:42
  • Possible duplicate of [Pyspark 'NoneType' object has no attribute '\_jvm' error](https://stackoverflow.com/questions/49481363/pyspark-nonetype-object-has-no-attribute-jvm-error) – pault Dec 09 '19 at 22:35

1 Answers1

0

The reason for error is you have not imported sum function. Import sum function and try, above error will not come.

from pyspark.sql.functions import sum
temp_df.groupby('batsman').agg(sum('runs')).show()
Karthik
  • 1,143
  • 7
  • 12
  • It is working now without even `cast`. Is it auto converting from string to Int? – Praveen L Dec 06 '19 at 12:28
  • 2
    I would caution that if you do `from pyspark.sql.functions import sum`, you're shadowing the builtin `sum`. See [this related post](https://stackoverflow.com/q/49481363/5858851) for what can happen if you do this. – pault Dec 06 '19 at 17:31
  • 1
    Thanks @pault. This clearly explains the context of the question and answer. The `sum` function in python is used in execution which caused the error. After the import, `sum` in spark functions was used and it resolves the issue. – Praveen L Dec 09 '19 at 07:09
  • This solution solves the issue. But, why describe() and dtypes on the dataframe showing different datatypes? – Praveen L Dec 10 '19 at 06:51