In my application, I am creating different data-frames from data in different locations on S3, and then trying to merge the dataframes into a single dataframes. Right now I am using a for loop for this. But I have a feeling this can be done in a much more efficient way using map and reduce functions in pyspark. Here's my code:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, GroupedData
import pandas as pd
from datetime import datetime
sparkConf = SparkConf().setAppName('myTestApp')
sc = SparkContext(conf=sparkConf)
sqlContext = SQLContext(sc)
filepath = 's3n://my-s3-bucket/report_date='
date_from = pd.to_datetime('2016-08-01',format='%Y-%m-%d')
date_to = pd.to_datetime('2016-08-22',format='%Y-%m-%d')
datelist = pd.date_range(date_from, date_to)
First = True
#THIS is the for-loop I want to get rid of
for dt in datelist:
date_string = datetime.strftime(dt, '%Y-%m-%d')
print('Running the pyspark - Data read for the date - '+date_string)
df = sqlContext.read.format("com.databricks.spark.csv").options(header = "false", inferschema = "true", delimiter = "\t").load(filepath + date_string + '/*.gz')
if First:
First=False
df_Full = df
else:
df_Full = df_Full.unionAll(df)