0

Using python, I should go till cwp folder and get into the date folder and read the parquet file. I have this folder structure inside s3.

Sample s3 path:

bucket name = lla.analytics.dev

path = bigdata/dna/fixed/cwp/dt=YYYY-MM-DD/file.parquet

s3://lla.analytics.dev/bigdata/dna/fixed/cwp/dt=2021-11-24/file.parquet
                                             dt=2021-11-25/file.parquet
                                             dt=2021-11-26/file.parquet
                                             ........................
                                             ........................
                                             dt=YYYY-MM-DD/file.parquet

I should access the recent date folder and read the files into dataframe from s3

1 Answers1

2

I see you have pyarrow tagged. If you would like to use pyarrow (disclaimer, I work with pyarrow), you should be able to do:

import pyarrow.fs as fs
import pyarrow.dataset as ds

s3, path = fs.FileSystem.from_uri("s3://lla.analytics.dev/bigdata/dna/fixed/cwp")
dataset = ds.dataset(path, partitioning='hive', filesystem=s3, format='parquet')
table = dataset.to_table()

There are a lot more details in pyarrow's filesystem docs and tabular dataset docs. There are also recipes for this on the pyarrow cookbook.

Pace
  • 41,875
  • 13
  • 113
  • 156
  • I'm getting this error: When getting information for key 'Bigdata/DNA/fixed/cwp' in bucket 'lla.analytics.dev': AWS Error [code 15]: No response body. – Pavithra Kannan Dec 08 '21 at 06:13
  • Dataset line getting some error..above code is not working... – Pavithra Kannan Dec 08 '21 at 06:14
  • My best guess would be a permissions issue or a region issue. By default that code will use your region and secret key from the same configuration files the [AWS CLI uses](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html). Are you able to list the files in that bucket using the AWS CLI? – Pace Dec 08 '21 at 06:28
  • "Unable to import module 'lambda_function': No module named 'pyarrow._dataset'", – Pavithra Kannan Dec 08 '21 at 07:18
  • I'm trying to execute the above code in Lambda but getting this error – Pavithra Kannan Dec 08 '21 at 07:18
  • That error would suggest that pyarrow is not properly installed on the lambda image. It's difficult to say what could be the cause without knowing a lot more about how the image is created. – Pace Dec 08 '21 at 07:45
  • Any other approach of reading parquet file to dataframe with date partitioned from s3? – Pavithra Kannan Dec 08 '21 at 08:15
  • I should access the recent date folder from s3 and read the parquet files into dataframe. Is there any approach to do this better? – Pavithra Kannan Dec 08 '21 at 08:32
  • There are a number of tools out there that can read partitioned parquet data but I'm not familiar with all of them and couldn't begin to say which is the best. – Pace Dec 08 '21 at 19:25