Sum a column in dataframe and return results as int

Question

I have a pyspark dataframe with a column of numbers(amount). I need to sum that column(amount) and then have the result return as an int in a python variable.

Dataset looks like this:

date,p1id,p2id,amount,p3id
1/1/2013,U2_P1,p@c.com,100,P_P

With the date grouping and aggregation by sum.

groupby_revenue = df.groupby(['date']).sum()
print(groupby_revenue)

output:

DataFrame[visitdate: string]

original output:

1/1/2013    13000
1/1/2014    16090
1/1/2015    17330

But need to get like this 1/1/2013 13000 1/1/2014 16090 1/1/2015 17330 — user10389226, Sep 20 '18 at 05:13

score 1 · Answer 1 · answered Sep 20 '18 at 06:53

1

You need to collect it:

from pyspark.sql import functions as F

groupby_revenue = df.groupby(['date']).agg(F.sum('amount')).collect()[0][0]

answered Sep 20 '18 at 06:53

hamza tuna

1,467
1
12
17

Getting the below error: ValueError: No objects to concatenat – user10389226 Sep 20 '18 at 08:28

score 1 · Answer 2 · answered Sep 20 '18 at 08:01

When working with spark you need to understand its execution process and programming api (pyspark - http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html). It is totally different from pandas/python execution. Its execution depends on the lazy evaluation, whenever you need to check data you need to perform an action like show, first, collect or take. Without these actions, it returns on dataframe with schema (so in your question).

Let me introduce you to a few things with an example:-

process_df = sqlContext.createDataFrame([
    ['2013-01-01','U2_P1','p@c.com','100','P_P'],
    ['2013-01-01','U2_P2','p@c.com','100','P_P1'],
    ['2014-01-01','U2_P1','p@c.com','100','P_P'],
    ['2014-01-01','U2_P2','p@c.com','100','P_P1'],
    ['2015-01-01','U2_P1','p@c.com','100','P_P'],
    ['2015-01-01','U2_P2','p@c.com','100','P_P1']
], ['date','p1id','p2id','amount','p3id'])

#This prints Schema instead of Data
print process_df
DataFrame[date: string, p1id: string, p2id: string, amount: string, p3id: string]

#This prints data instead of schema
process_df.show()
+----------+-----+-------+------+----+
|      date| p1id|   p2id|amount|p3id|
+----------+-----+-------+------+----+
|2013-01-01|U2_P1|p@c.com|   100| P_P|
|2013-01-01|U2_P2|p@c.com|   100|P_P1|
|2014-01-01|U2_P1|p@c.com|   100| P_P|
|2014-01-01|U2_P2|p@c.com|   100|P_P1|
|2015-01-01|U2_P1|p@c.com|   100| P_P|
|2015-01-01|U2_P2|p@c.com|   100|P_P1|
+----------+-----+-------+------+----+

agg_data = process_df.groupby(['date']).agg({'amount':'sum'})

#This prints Schema instead of Data
print agg_data
DataFrame[date: string, sum(amount): double]

from pyspark.sql import functions as F

#This prints data instead of schema
agg_data.show()
+----------+-----------+
|      date|sum(amount)|
+----------+-----------+
|2015-01-01|      200.0|
|2014-01-01|      200.0|
|2013-01-01|      200.0|
+----------+-----------+

from pyspark.sql import functions as F
agg_data.select('date', F.col('sum(amount)').alias('sum')).show()
+----------+-----+
|      date|  sum|
+----------+-----+
|2015-01-01|200.0|
|2014-01-01|200.0|
|2013-01-01|200.0|
+----------+-----+

This is an example to print only data if you need to take this data in python then need to use collect, take, first, head. Here are a few examples:-

print agg_data.collect()
[Row(date=u'2015-01-01', sum(amount)=200.0),
 Row(date=u'2014-01-01', sum(amount)=200.0),
 Row(date=u'2013-01-01', sum(amount)=200.0)]
print agg_data.first()
Row(date=u'2015-01-01', sum(amount)=200.0)
print agg_data.take(1)
[Row(date=u'2015-01-01', sum(amount)=200.0)]
agg_data.head()
Row(date=u'2015-01-01', sum(amount)=200.0)

This is how we can take data to python and can wrangle over it.

Hope this will help a lot.

@user10389226 I found this link for saving data into sql db https://stackoverflow.com/questions/46552161/write-dataframe-to-mysql-table-using-pyspark, Basically, you need connection string and options. — Rakesh Kumar, Sep 20 '18 at 09:03
How to install jdbc driver in windows? while executing its shows sc not define(How to import sqlcontext)? — user10389226, Sep 20 '18 at 09:21

Sum a column in dataframe and return results as int

2 Answers2