1

I have saved the spark dataframe to AWS S3 in Parquet format partitionby column "channel_name". Below code is how i saved to S3:

df.write.option("header",True) \
        .partitionBy("channel_name") \
        .mode("overwrite") \
        .parquet("s3://path/")
channel_name start_timestamp value Outlier
TEMP 2021-07-19 07:27:51 21 false
TEMP 2021-07-19 08:21:05 24 false
Vel 2021-07-19 08:20:18 22 false
Vel 2021-07-19 08:21:54 26 false
TEMP 2021-07-19 08:21:23 25 false
TEMP 2021-07-16 08:22:41 88 false

As it was partitionby "channel_name",Now while reading the same data from S3 it is missing that column "channel_name". below is my code for pyspark and for python.

df = spark.read.parquet("s3://Path/") #spark

for Python i am using AWS wrangler:

import awswrangler as wr

df = wr.s3.read_parquet(path="s3://Path/")

This is how df looks like without column "channel_name".

start_timestamp value Outlier
2021-07-19 07:27:51 21 false
2021-07-19 08:21:05 24 false
2021-07-19 08:20:18 22 false
2021-07-19 08:21:54 26 false
2021-07-19 08:21:23 25 false
2021-07-16 08:22:41 88 false

How to read complete data including partition column, please let me know if there any alternative.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
SSS
  • 73
  • 11

0 Answers0