Read RC File pyspark 2.0.0 from S3 with partitions

Question

Is there a way that we can load RC Files with partitioned stored in S3 into pyspark Dataframe 2.0.0

This is one of the columnar file format that is used to store data. performs better compared to csv format — braj, Jan 06 '17 at 06:56

score 1 · Answer 1 · answered Jan 06 '17 at 07:30

I have figured out a way to load RCFiles(from s3) to pyspark.

from pyspark.sql import HiveContext
spark = SparkSession.builder.master("yarn").appName("elevateDailyJob").enableHiveSupport().getOrCreate()
sc = spark.sparkContext
sqlContext = HiveContext(sc)
sqlContext.sql("CREATE EXTERNAL TABLE table1(col1 string,col2 string,col3 string,)PARTITIONED BY (DAYSERIAL_NUMERIC string) STORED AS RCFILE LOCATION 's3://my-databucket/my_file_rc/'")
df = sqlContext.sql("select * from table1")

the above can be run using spark-submit. Note: you need to enable hivesupport for EMR version 5.x on wards(like I have done in 2nd line of the code.

Read RC File pyspark 2.0.0 from S3 with partitions

1 Answers1