How to read parquet file partitioned by date folder to dataframe from s3 using python?

Question

Using python, I should go till cwp folder and get into the date folder and read the parquet file. I have this folder structure inside s3.

Sample s3 path:

bucket name = lla.analytics.dev

path = bigdata/dna/fixed/cwp/dt=YYYY-MM-DD/file.parquet

s3://lla.analytics.dev/bigdata/dna/fixed/cwp/dt=2021-11-24/file.parquet
                                             dt=2021-11-25/file.parquet
                                             dt=2021-11-26/file.parquet
                                             ........................
                                             ........................
                                             dt=YYYY-MM-DD/file.parquet

I should access the recent date folder and read the files into dataframe from s3

Try adding these two lines before the call to `fp_obj.to_pandas()`: `import pandas` and `print(pandas.__version__)` — , Dec 07 '21 at 18:25

score 2 · Answer 1 · answered Dec 07 '21 at 23:20

2

I see you have pyarrow tagged. If you would like to use pyarrow (disclaimer, I work with pyarrow), you should be able to do:

import pyarrow.fs as fs
import pyarrow.dataset as ds

s3, path = fs.FileSystem.from_uri("s3://lla.analytics.dev/bigdata/dna/fixed/cwp")
dataset = ds.dataset(path, partitioning='hive', filesystem=s3, format='parquet')
table = dataset.to_table()

There are a lot more details in pyarrow's filesystem docs and tabular dataset docs. There are also recipes for this on the pyarrow cookbook.

answered Dec 07 '21 at 23:20

Pace

41,875
13
113
156

I'm getting this error: When getting information for key 'Bigdata/DNA/fixed/cwp' in bucket 'lla.analytics.dev': AWS Error [code 15]: No response body. – Pavithra Kannan Dec 08 '21 at 06:13
Dataset line getting some error..above code is not working... – Pavithra Kannan Dec 08 '21 at 06:14
My best guess would be a permissions issue or a region issue. By default that code will use your region and secret key from the same configuration files the [AWS CLI uses](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html). Are you able to list the files in that bucket using the AWS CLI? – Pace Dec 08 '21 at 06:28
"Unable to import module 'lambda_function': No module named 'pyarrow._dataset'", – Pavithra Kannan Dec 08 '21 at 07:18
I'm trying to execute the above code in Lambda but getting this error – Pavithra Kannan Dec 08 '21 at 07:18
That error would suggest that pyarrow is not properly installed on the lambda image. It's difficult to say what could be the cause without knowing a lot more about how the image is created. – Pace Dec 08 '21 at 07:45
Any other approach of reading parquet file to dataframe with date partitioned from s3? – Pavithra Kannan Dec 08 '21 at 08:15
I should access the recent date folder from s3 and read the parquet files into dataframe. Is there any approach to do this better? – Pavithra Kannan Dec 08 '21 at 08:32
There are a number of tools out there that can read partitioned parquet data but I'm not familiar with all of them and couldn't begin to say which is the best. – Pace Dec 08 '21 at 19:25

How to read parquet file partitioned by date folder to dataframe from s3 using python?

1 Answers1