1

Is there a way that we can load RC Files with partitioned stored in S3 into pyspark Dataframe 2.0.0

Yaron
  • 10,166
  • 9
  • 45
  • 65
braj
  • 2,545
  • 2
  • 29
  • 40
  • This is one of the columnar file format that is used to store data. performs better compared to csv format – braj Jan 06 '17 at 06:56

1 Answers1

1

I have figured out a way to load RCFiles(from s3) to pyspark.

from pyspark.sql import HiveContext
spark = SparkSession.builder.master("yarn").appName("elevateDailyJob").enableHiveSupport().getOrCreate()
sc = spark.sparkContext
sqlContext = HiveContext(sc)
sqlContext.sql("CREATE EXTERNAL TABLE table1(col1 string,col2 string,col3 string,)PARTITIONED BY (DAYSERIAL_NUMERIC string) STORED AS RCFILE LOCATION 's3://my-databucket/my_file_rc/'")
df = sqlContext.sql("select * from table1")

the above can be run using spark-submit. Note: you need to enable hivesupport for EMR version 5.x on wards(like I have done in 2nd line of the code.

braj
  • 2,545
  • 2
  • 29
  • 40