-1

I am facing a problem in PySpark Dataframe loaded from a CSV file , where my numeric column do have empty values Like below

+-------------+------------+-----------+-----------+
|  Player_Name|Test_Matches|ODI_Matches|T20_Matches|
+-------------+------------+-----------+-----------+
|   Aaron, V R|           9|          9|           |
|  Abid Ali, S|          29|          5|           |
|Adhikari, H R|          21|           |           |
| Agarkar, A B|          26|        191|          4|
+-------------+------------+-----------+-----------+

Casted those columns to integer and all those empty become null

df_data_csv_casted = df_data_csv.select(df_data_csv['Country'],df_data_csv['Player_Name'],                                        df_data_csv['Test_Matches'].cast(IntegerType()).alias("Test_Matches"),                                       df_data_csv['ODI_Matches'].cast(IntegerType()).alias("ODI_Matches"),                                         df_data_csv['T20_Matches'].cast(IntegerType()).alias("T20_Matches"))


+-------------+------------+-----------+-----------+
|  Player_Name|Test_Matches|ODI_Matches|T20_Matches|
+-------------+------------+-----------+-----------+
|   Aaron, V R|           9|          9|       null|
|  Abid Ali, S|          29|          5|       null|
|Adhikari, H R|          21|       null|       null|
| Agarkar, A B|          26|        191|          4|
+-------------+------------+-----------+-----------+

Then I am taking a total , but if one of them is null , result is also coming as null. How to solve it ?

df_data_csv_withTotalCol=df_data_csv_casted.withColumn('Total_Matches',(df_data_csv_casted['Test_Matches']+df_data_csv_casted['ODI_Matches']+df_data_csv_casted['T20_Matches']))

+-------------+------------+-----------+-----------+-------------+
|Player_Name  |Test_Matches|ODI_Matches|T20_Matches|Total_Matches|
+-------------+------------+-----------+-----------+-------------+
| Aaron, V R  |           9|          9|       null|         null|
|Abid Ali, S  |          29|          5|       null|         null|
|Adhikari, H R|          21|       null|       null|         null|
|Agarkar, A B |          26|        191|          4|          221|
+-------------+------------+-----------+-----------+-------------+
vinu.m.19
  • 495
  • 2
  • 8
  • 16
  • 1
    Fill the null values with 0. [How to replace null values](https://stackoverflow.com/questions/42312042/how-to-replace-all-null-values-of-a-dataframe-in-pyspark) – pault Apr 06 '19 at 20:47

1 Answers1

0

You can fix this by using coalesce function . for example , lets create some sample data

from pyspark.sql.functions import coalesce,lit

cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
cDf.show()

+----+----+
|   a|   b|
+----+----+
|null|null|
|   1|null|
|null|   2|
+----+----+

When I do simple sum as you did -

cDf.withColumn('Total',cDf.a+cDf.b).show()

I get total as null , same as you described-

+----+----+-----+

|   a|   b|Total|
+----+----+-----+
|null|null| null|
|   1|null| null|
|null|   2| null|
+----+----+-----+

to fix, use coalesce along with lit function , which replaces null values by zeroes.

cDf.withColumn('Total',coalesce(cDf.a,lit(0)) +coalesce(cDf.b,lit(0))).show()

this gives me correct results-

|   a|   b|Total|
+----+----+-----+
|null|null|    0|
|   1|null|    1|
|null|   2|    2|
+----+----+-----+
Pushkr
  • 3,591
  • 18
  • 31
  • Thanks Pushkr for the feedback's. However its not working when the data frame is created by loading file , but its working if we loaded the data pragmatically as you given above. – vinu.m.19 Apr 16 '19 at 19:52