2

I have to run some tests on different environments. In tests I have to check some directories in s3 to find parquet files and transfer them to dictionary like this

import pyarrow.parquet as pq
import s3fs

env = 'dev'
aws_profile ={'dev': 'dev_aws_profile', 'qa': 'qa_aws_profile'}
def get_dictionary_from_parquet(file_name):
    fs = s3fs.S3FileSystem()
    pq_session = Session(profile_name=aws_profile[env])
    s3 = pq_session.resource('s3')
    parquet_bucket = s3.Bucket(f'valid-bucket-name-{env}')
    paths = []
    for pq_file in parquet_bucket.objects.filter(Prefix=f'valid-prefix-{env}'):
    if pq_file.key.endswith(file_name):
        paths.append(f's3://{pq_file.bucket_name}/{pq_file.key}')
    data_set = pq.ParquetDataset(paths, filesystem=fs)
    tbl = data_set.read()
    pq_dictionary = tbl.to_pydict()
    return pq_dictionary

it works perfectly if aws_profile == default profile in aws credentials file, but it returns

line 14, in get_dictionary_from_parquet
    data_set = pq.ParquetDataset(paths, filesystem=fs)
  File "/Library/Python/3.7/site-packages/pyarrow/parquet.py", line 1170, in __init__
    open_file_func=partial(_open_dataset_file, self._metadata)
  File "/Library/Python/3.7/site-packages/pyarrow/parquet.py", line 1365, in _make_manifest
    .format(path))    
OSError: Passed non-file path: s3://<valid path to parquet file>

how to parse aws profile creds to pyarrow to fix it?

Alex Y
  • 33
  • 1
  • 5

1 Answers1

0

It is weird that you are configuring and doing your file filtering on boto(3?) object, while using the s3fs instance to specify the filesystem when reading. I recommend using s3fs for both.

The following will fix it

fs = s3fs.S3FileSystem(profile=aws_profile[env])

but I would suggest that you can use the same instance to do your file listing too

paths = fs.glob(f"valid-bucket-name-{env}/valid-prefix-{env}/*/file_name")

(or whatever the right glob pattern is - I had trouble parsing your code).

joris
  • 133,120
  • 36
  • 247
  • 202
mdurant
  • 27,272
  • 5
  • 45
  • 74
  • I have some if else logic to gather all paths for data set, I've just copy-paste part of code and simplify it to make it shorter and more readable, filtering is another separate function. But it is a good idea to get rid of boto part, thanks for your help – Alex Y Oct 28 '20 at 20:26