I have saved the spark dataframe to AWS S3 in Parquet format partitionby column "channel_name". Below code is how i saved to S3:
df.write.option("header",True) \
.partitionBy("channel_name") \
.mode("overwrite") \
.parquet("s3://path/")
channel_name | start_timestamp | value | Outlier |
---|---|---|---|
TEMP | 2021-07-19 07:27:51 | 21 | false |
TEMP | 2021-07-19 08:21:05 | 24 | false |
Vel | 2021-07-19 08:20:18 | 22 | false |
Vel | 2021-07-19 08:21:54 | 26 | false |
TEMP | 2021-07-19 08:21:23 | 25 | false |
TEMP | 2021-07-16 08:22:41 | 88 | false |
As it was partitionby "channel_name",Now while reading the same data from S3 it is missing that column "channel_name". below is my code for pyspark and for python.
df = spark.read.parquet("s3://Path/") #spark
for Python i am using AWS wrangler:
import awswrangler as wr
df = wr.s3.read_parquet(path="s3://Path/")
This is how df looks like without column "channel_name".
start_timestamp | value | Outlier |
---|---|---|
2021-07-19 07:27:51 | 21 | false |
2021-07-19 08:21:05 | 24 | false |
2021-07-19 08:20:18 | 22 | false |
2021-07-19 08:21:54 | 26 | false |
2021-07-19 08:21:23 | 25 | false |
2021-07-16 08:22:41 | 88 | false |
How to read complete data including partition column, please let me know if there any alternative.