0

I am trying to access the parquet file that's available in S3 bucket using Pyspark local via Pycharm. I have the AWS toolkit configured in Pycharm and I have the access key and security key added in my ~/.aws/credentials yet I see the credentials are not getting accessed. Which throws me the error "Unable to load AWS credentials from any provider in the chain"

import os
import pyspark
from pyspark.sql import SparkSession


os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'

spark = SparkSession.builder\
            .appName('Pyspark').getOrCreate()

my_df = spark.read.\
    parquet("s3a://<parquet_file_location>") --Using s3 gives me no file system error

my_df.printSchema()

Is there any alternative approach to try Pyspark locally and access the AWS resources.

Also I should be able to use s3 in parquet path but that seems to throw an error with file system not found. Does any dependency or jar file needs to be added for running the Pyspark locally

DataWrangler
  • 1,804
  • 17
  • 32

1 Answers1

0

if you set the secrets in AWS_ env vars they will be picked up, and then propagated with the job. Otherwise you can set them in spark-defaults.conf with the appropriate spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key.

stevel
  • 12,567
  • 1
  • 39
  • 50
  • Had tried the environment variables previously and also tried giving the spark configuration settings during the spark session creation, didn't help same issue as before...... I am trying to run spark locally, there is no hadoop setup.. – DataWrangler Sep 10 '20 at 15:15
  • spark uses the hadoop jars. you have them. its where the s3a connector is implemented. If you don't have hadoop-aws.jar on the classpath and a consistent version of the AWS sDK, you aren't going to get anywhere – stevel Sep 11 '20 at 14:31
  • The jars are pulled using the following code `os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'` what I meant to say that I don't have a hadoop setup is more to say that as I am trying to run this as a local instance (Mac/Windows) – DataWrangler Sep 16 '20 at 08:09
  • 1
    joby, I understand.Just remember that spark uses the hadoop libraries to talk to filesystems and object stores. – stevel Sep 17 '20 at 17:20
  • Ah I thought my thought my statements were misleading, sorry that I had to stress again on the Local instance run :) I am still getting the same issue tried with most of the possible ways to pass the credentials to the Spark... – DataWrangler Sep 20 '20 at 13:04