1

I have a pyspark dataframe with a column of numbers(amount). I need to sum that column(amount) and then have the result return as an int in a python variable.

Dataset looks like this:

date,p1id,p2id,amount,p3id
1/1/2013,U2_P1,p@c.com,100,P_P

With the date grouping and aggregation by sum.

groupby_revenue = df.groupby(['date']).sum()
print(groupby_revenue)

output:

DataFrame[visitdate: string]

original output:

1/1/2013    13000
1/1/2014    16090
1/1/2015    17330
user10389226
  • 109
  • 3
  • 14

2 Answers2

1

You need to collect it:

from pyspark.sql import functions as F

groupby_revenue = df.groupby(['date']).agg(F.sum('amount')).collect()[0][0]
hamza tuna
  • 1,467
  • 1
  • 12
  • 17
1

When working with spark you need to understand its execution process and programming api (pyspark - http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html). It is totally different from pandas/python execution. Its execution depends on the lazy evaluation, whenever you need to check data you need to perform an action like show, first, collect or take. Without these actions, it returns on dataframe with schema (so in your question).

Let me introduce you to a few things with an example:-

process_df = sqlContext.createDataFrame([
    ['2013-01-01','U2_P1','p@c.com','100','P_P'],
    ['2013-01-01','U2_P2','p@c.com','100','P_P1'],
    ['2014-01-01','U2_P1','p@c.com','100','P_P'],
    ['2014-01-01','U2_P2','p@c.com','100','P_P1'],
    ['2015-01-01','U2_P1','p@c.com','100','P_P'],
    ['2015-01-01','U2_P2','p@c.com','100','P_P1']
], ['date','p1id','p2id','amount','p3id'])

#This prints Schema instead of Data
print process_df
DataFrame[date: string, p1id: string, p2id: string, amount: string, p3id: string]

#This prints data instead of schema
process_df.show()
+----------+-----+-------+------+----+
|      date| p1id|   p2id|amount|p3id|
+----------+-----+-------+------+----+
|2013-01-01|U2_P1|p@c.com|   100| P_P|
|2013-01-01|U2_P2|p@c.com|   100|P_P1|
|2014-01-01|U2_P1|p@c.com|   100| P_P|
|2014-01-01|U2_P2|p@c.com|   100|P_P1|
|2015-01-01|U2_P1|p@c.com|   100| P_P|
|2015-01-01|U2_P2|p@c.com|   100|P_P1|
+----------+-----+-------+------+----+

agg_data = process_df.groupby(['date']).agg({'amount':'sum'})

#This prints Schema instead of Data
print agg_data
DataFrame[date: string, sum(amount): double]

from pyspark.sql import functions as F

#This prints data instead of schema
agg_data.show()
+----------+-----------+
|      date|sum(amount)|
+----------+-----------+
|2015-01-01|      200.0|
|2014-01-01|      200.0|
|2013-01-01|      200.0|
+----------+-----------+

from pyspark.sql import functions as F
agg_data.select('date', F.col('sum(amount)').alias('sum')).show()
+----------+-----+
|      date|  sum|
+----------+-----+
|2015-01-01|200.0|
|2014-01-01|200.0|
|2013-01-01|200.0|
+----------+-----+

This is an example to print only data if you need to take this data in python then need to use collect, take, first, head. Here are a few examples:-

print agg_data.collect()
[Row(date=u'2015-01-01', sum(amount)=200.0),
 Row(date=u'2014-01-01', sum(amount)=200.0),
 Row(date=u'2013-01-01', sum(amount)=200.0)]
print agg_data.first()
Row(date=u'2015-01-01', sum(amount)=200.0)
print agg_data.take(1)
[Row(date=u'2015-01-01', sum(amount)=200.0)]
agg_data.head()
Row(date=u'2015-01-01', sum(amount)=200.0)

This is how we can take data to python and can wrangle over it.

Hope this will help a lot.
Rakesh Kumar
  • 4,319
  • 2
  • 17
  • 30
  • How to store the output of this data into spark sql db? – user10389226 Sep 20 '18 at 08:53
  • @user10389226 I found this link for saving data into sql db https://stackoverflow.com/questions/46552161/write-dataframe-to-mysql-table-using-pyspark, Basically, you need connection string and options. – Rakesh Kumar Sep 20 '18 at 09:03
  • How to install jdbc driver in windows? while executing its shows sc not define(How to import sqlcontext)? – user10389226 Sep 20 '18 at 09:21