I am trying to load a pd.DataFrame, reading it form a parquet file in aws.
I am applying partition_filter to get only certain data from the df that corresponds to the conditions I want. In this particular case, the df column, df['source']
must be equal to the crawler_name
string variable.
I apply the s3.read_parquete in the following way:
def read_aws_parquet(aws_uri: str, crawler_name: str, year: str, month: str, day:Optional[str], columns: list = ['url', 'crawl_timestamp', 'source']) -> pd.DataFrame:
my_filter = lambda x: True if x["source"] == crawler_name else False
# my_filter = lambda x: x["source"] == crawler_name
logger.warning(f'{aws_uri} is a valid url')
return wr.s3.read_parquet(
path=aws_uri,
path_suffix=".snappy.parquet",
use_threads=True,
columns=columns, # 'deactivated_timestamp' must be added later
partition_filter=my_filter,
dataset=True
# pyarrow_additional_kwargs={"filters": [('source', '=', crawler_name)]}
)
If I don't apply the partition_filter
I receive a complete dataframe that contains the crawler_name several times in the column 'source'
. In that way, I verified this value existed.
but if I apply the filter, I get the following error
File "/Users/administrator/Library/Caches/pypoetry/virtualenvs/pf-11294-manual-review-of-the-tracking-url-MC2_y5HC-py3.10/lib/python3.10/site-packages/awswrangler/s3/_read_parquet.py", line 749, in read_parquet
paths = _apply_partition_filter(path_root=path_root, paths=paths, filter_func=partition_filter)
File "/Users/administrator/Library/Caches/pypoetry/virtualenvs/pf-11294-manual-review-of-the-tracking-url-MC2_y5HC-py3.10/lib/python3.10/site-packages/awswrangler/s3/_read.py", line 88, in _apply_partition_filter
return [p for p in paths if filter_func(_extract_partitions_from_path(path_root=path_root, path=p)) is True]
File "/Users/administrator/Library/Caches/pypoetry/virtualenvs/pf-11294-manual-review-of-the-tracking-url-MC2_y5HC-py3.10/lib/python3.10/site-packages/awswrangler/s3/_read.py", line 88, in <listcomp>
return [p for p in paths if filter_func(_extract_partitions_from_path(path_root=path_root, path=p)) is True]
File "<input>", line 19, in <lambda>
KeyError: 'source'
What is wrong with my partition filter?