31

I am new to Spark and I am not able to find this... I have a lot of parquet files uploaded into s3 at location :

s3://a-dps/d-l/sco/alpha/20160930/parquet/

The total size of this folder is 20+ Gb,. How to chunk and read this into a dataframe How to load all these files into a dataframe?

Allocated memory to spark cluster is 6 gb.

    from pyspark import SparkContext
    from pyspark.sql import SQLContext
    from pyspark import SparkConf
    from pyspark.sql import SparkSession
    import pandas
    # SparkConf().set("spark.jars.packages","org.apache.hadoop:hadoop-aws:3.0.0-alpha3")
    sc = SparkContext.getOrCreate()

    sc._jsc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", 'A')
    sc._jsc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", 's')

    sqlContext = SQLContext(sc)
    df2 = sqlContext.read.parquet("s3://sm/data/scor/alpha/2016/parquet/*")

Error :


    Py4JJavaError: An error occurred while calling o33.parquet.
    : java.io.IOException: No FileSystem for scheme: s3
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:372)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
        at scala.collection.immutable.List.flatMap(List.scala:344)

 
Viv
  • 1,474
  • 5
  • 28
  • 47
  • also needed to add the packages in spark folder : org.apache.hadoop:hadoop-aws:3.0.0-alpha3, org.apache.httpcomponents:httpclient:4.3.6, org.apache.httpcomponents:httpcore:4.3.3, com.amazonaws:aws-java-sdk-core:1.10.27, com.amazonaws:aws-java-sdk-s3:1.10.27, com.amazonaws:aws-java-sdk-sts:1.10.27 – Viv Jun 20 '17 at 10:08
  • Maybe this gist can help you: https://gist.github.com/asmaier/5768c7cda3620901440a62248614bbd0 – asmaier Sep 08 '17 at 16:38

2 Answers2

28

You've to use SparkSession instead of sqlContext since Spark 2.0

spark = SparkSession.builder
                        .master("local")             
                        .appName("app name")             
                        .config("spark.some.config.option", true).getOrCreate()

df = spark.read.parquet("s3://path/to/parquet/file.parquet")
Artem
  • 1,157
  • 1
  • 14
  • 24
25

The file schema (s3)that you are using is not correct. You'll need to use the s3n schema or s3a (for bigger s3 objects):

// use sqlContext instead for spark <2 
val df = spark.read 
              .load("s3n://bucket-name/object-path")

I suggest that you read more about the Hadoop-AWS module: Integration with Amazon Web Services Overview.

eliasah
  • 39,588
  • 11
  • 124
  • 154
  • 2
    does this mean i should change something in AWS S3 to get the url to s3n instead of s3? OR can i blindly use s3n in the code in-place of s3 – Viv Jun 20 '17 at 07:49
  • if you have provided the credentials, s3n would be enough. Sometimes it might need you to provide an endpoint like I have described here https://stackoverflow.com/questions/44589563/unable-to-read-from-s3-bucket-using-spark/44590124#44590124 – eliasah Jun 20 '17 at 07:50
  • also how to chunk it? – Viv Jun 20 '17 at 07:50
  • what do you mean ? – eliasah Jun 20 '17 at 07:51
  • there are many .parquet files in that folder. which is total of 20+ gb, but my spark has 6 gb space only. So, it has to read into a df means i need to read in 4 chunks? – Viv Jun 20 '17 at 07:54
  • well it's not just about partitions, it also depends on what you need to do with that data. 6GB is too small compared to your dataset, there will be more than just partitioning to be able to perform a job... what is it you want to do with ? – eliasah Jun 20 '17 at 08:14
  • 1
    I want to load it into a df, do some changes in the columns and update the dataframe and finally push back to s3. – Viv Jun 20 '17 at 08:18
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/147134/discussion-between-eliasah-and-viv). – eliasah Jun 20 '17 at 08:21
  • I'm also trying to do the same thing where I need to retrieve subset of parquet files since my spark cluster has not enough memory. Did you find a solution? – haneulkim Nov 09 '22 at 01:31