I am using deltalake 0.4.5 Python library to read .parquet
files into a deltatable
and then convert into a pandas
dataframe, following the instructions here: https://pypi.org/project/deltalake/.
Here's the Python code to do this:
from deltalake import DeltaTable
table_path = "s3://example_bucket/data/poc"
dt = DeltaTable(table_path)
files = dt.files() # OK, returns the list of parquet files with full s3 path
# ['s3://example_bucket/data/poc/y=2021/m=4/d=13/h=16/part-00001-8765abc67.parquet',
# 's3://example_bucket/data/poc/y=2021/m=4/d=13/h=16/part-00002-7643adc87.parquet',
# ........]
total_file_count = len(files0) # OK, returns 115530
pt = dt.to_pyarrow_table() # hangs
df = dt.to_pyarrow_table().to_pandas() # hangs
I believe it hangs because of the number of files is high 115K+.
So for my PoC, I wanted to read files only for a day or hour. So, I tried to set the table_path
variable up to the hour, but it gives Not a Delta table
error as, showing below:
table_path = "s3://example_bucket/data/poc"
dt = DeltaTable(table_path)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib64/python3.7/site-packages/deltalake/table.py", line 19, in __init__
self._table = RawDeltaTable(table_path, version=version)
deltalake.PyDeltaTableError: Not a Delta table
How can I achieve this?
If deltalake
Python library can't be used to achieve this, what other tools/libraries are there I should try?