1

I am trying to load a pd.DataFrame, reading it form a parquet file in aws.

I am applying partition_filter to get only certain data from the df that corresponds to the conditions I want. In this particular case, the df column, df['source'] must be equal to the crawler_name string variable.

I apply the s3.read_parquete in the following way:

def read_aws_parquet(aws_uri: str, crawler_name: str, year: str, month: str, day:Optional[str], columns: list = ['url', 'crawl_timestamp', 'source']) -> pd.DataFrame:

    my_filter = lambda x: True if x["source"] == crawler_name else False
    # my_filter = lambda x: x["source"] == crawler_name
    logger.warning(f'{aws_uri} is a valid url')
    return wr.s3.read_parquet(
        path=aws_uri,
        path_suffix=".snappy.parquet",
        use_threads=True,
        columns=columns,  # 'deactivated_timestamp' must be added later
        partition_filter=my_filter,
        dataset=True
        # pyarrow_additional_kwargs={"filters": [('source', '=', crawler_name)]}
    )

If I don't apply the partition_filter I receive a complete dataframe that contains the crawler_name several times in the column 'source'. In that way, I verified this value existed.

but if I apply the filter, I get the following error

File "/Users/administrator/Library/Caches/pypoetry/virtualenvs/pf-11294-manual-review-of-the-tracking-url-MC2_y5HC-py3.10/lib/python3.10/site-packages/awswrangler/s3/_read_parquet.py", line 749, in read_parquet
    paths = _apply_partition_filter(path_root=path_root, paths=paths, filter_func=partition_filter)
  File "/Users/administrator/Library/Caches/pypoetry/virtualenvs/pf-11294-manual-review-of-the-tracking-url-MC2_y5HC-py3.10/lib/python3.10/site-packages/awswrangler/s3/_read.py", line 88, in _apply_partition_filter
    return [p for p in paths if filter_func(_extract_partitions_from_path(path_root=path_root, path=p)) is True]
  File "/Users/administrator/Library/Caches/pypoetry/virtualenvs/pf-11294-manual-review-of-the-tracking-url-MC2_y5HC-py3.10/lib/python3.10/site-packages/awswrangler/s3/_read.py", line 88, in <listcomp>
    return [p for p in paths if filter_func(_extract_partitions_from_path(path_root=path_root, path=p)) is True]
  File "<input>", line 19, in <lambda>
KeyError: 'source'

What is wrong with my partition filter?

The Dan
  • 1,408
  • 6
  • 16
  • 41

0 Answers0