store parquet files (in aws s3) into a spark dataframe using pyspark

Question

I'm trying to read data from a specific folder in my s3 bucket. This data is in parquet format. To do that I'm using awswrangler:

import awswrangler as wr

# read data
data = wr.s3.read_parquet("s3://bucket-name/folder/with/parquet/files/", dataset = True)

This returns a pandas dataframe:

client_id   center  client_lat  client_lng  inserted_at  matrix_updated
0700292081   BFDR    -23.6077    -46.6617   2021-04-19     2021-04-19   
7100067781   BFDR    -23.6077    -46.6617   2021-04-19     2021-04-19   
7100067787   BFDR    -23.6077    -46.6617   2021-04-19     2021-04-19

However, instead of a pandas dataframe I would like to store this data retrieved from my s3 bucket in a spark dataframe. I've tried doing this(which is my own question), but seems not to be working correctly.

I was wondering if there is any way I could store this data into a spark dataframe using awswrangler. Or if you have an alternative I would like to read about it.

Why not use the native `spark.read.parquet(PATH)` method? Another option is doing spark.createDataFrame(data). — Assaf Segev, Jun 09 '21 at 18:10
I think `spark.read.parquet(PATH)` is for local files, and `spark.createDataFrame(data)` is not the best approach since the idea is to completely avoid using pandas dataframes. That is why I'm looking for a solution where I can directly store my data in a spark dataframe. — brenda, Jun 09 '21 at 18:15
Not sure I understand. What do you mean "local files"? I read from s3 every day while files are in cloud. — Assaf Segev, Jun 09 '21 at 18:17
I often use `spark.read.parquet(PATH)` to read files from my machine. I'm not sure how to use that code to read files from s3. — brenda, Jun 09 '21 at 18:22
What I do is - `spark.read.parquet("s3://bucket-name/folder/with/parquet/files/")`. If the parquet files are there, it should work. — Assaf Segev, Jun 09 '21 at 18:25
I get this error: `An error occurred while calling o33.parquet. : java.io.IOException: No FileSystem for scheme: s3`. — brenda, Jun 09 '21 at 18:29
in that case, I think a library installation is needed. See this for example https://stackoverflow.com/questions/44629156/how-to-read-parquet-data-from-s3-to-spark-dataframe-python — Assaf Segev, Jun 09 '21 at 18:32

score 1 · Accepted Answer · answered Jun 09 '21 at 21:13

I didn't use awswrangler. Instead I used the following code which I found on this github:

myAccessKey = 'your key' 
mySecretKey = 'your key'

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 pyspark-shell'

import pyspark
sc = pyspark.SparkContext("local[*]")

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", myAccessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", mySecretKey)

df = sqlContext.read.parquet("s3://bucket-name/path/")

store parquet files (in aws s3) into a spark dataframe using pyspark

1 Answers1