Read data from AWS S3 using pyspark and python. (Read All columns: Partioned column also)

Asked Jan 07 '22 at 14:15

Active Jan 30 '22 at 23:18

Viewed 484 times

I have saved the spark dataframe to AWS S3 in Parquet format partitionby column "channel_name". Below code is how i saved to S3:

df.write.option("header",True) \
        .partitionBy("channel_name") \
        .mode("overwrite") \
        .parquet("s3://path/")

channel_name	start_timestamp	value	Outlier
TEMP	2021-07-19 07:27:51	21	false
TEMP	2021-07-19 08:21:05	24	false
Vel	2021-07-19 08:20:18	22	false
Vel	2021-07-19 08:21:54	26	false
TEMP	2021-07-19 08:21:23	25	false
TEMP	2021-07-16 08:22:41	88	false

As it was partitionby "channel_name",Now while reading the same data from S3 it is missing that column "channel_name". below is my code for pyspark and for python.

df = spark.read.parquet("s3://Path/") #spark

for Python i am using AWS wrangler:

import awswrangler as wr

df = wr.s3.read_parquet(path="s3://Path/")

This is how df looks like without column "channel_name".

start_timestamp	value	Outlier
2021-07-19 07:27:51	21	false
2021-07-19 08:21:05	24	false
2021-07-19 08:20:18	22	false
2021-07-19 08:21:54	26	false
2021-07-19 08:21:23	25	false
2021-07-16 08:22:41	88	false

How to read complete data including partition column, please let me know if there any alternative.

edited Jan 30 '22 at 23:18

John Rotenstein

241,921
22
380
470

asked Jan 07 '22 at 14:15

SSS

Read data from AWS S3 using pyspark and python. (Read All columns: Partioned column also)

0 Answers0