1

I am using deltalake 0.4.5 Python library to read .parquet files into a deltatable and then convert into a pandas dataframe, following the instructions here: https://pypi.org/project/deltalake/.

Here's the Python code to do this:

from deltalake import DeltaTable

table_path = "s3://example_bucket/data/poc"
dt = DeltaTable(table_path)
files = dt.files()             # OK, returns the list of parquet files with full s3 path 
   # ['s3://example_bucket/data/poc/y=2021/m=4/d=13/h=16/part-00001-8765abc67.parquet', 
   #  's3://example_bucket/data/poc/y=2021/m=4/d=13/h=16/part-00002-7643adc87.parquet',
   #  ........]
total_file_count = len(files0) # OK, returns 115530

pt = dt.to_pyarrow_table()             # hangs
df = dt.to_pyarrow_table().to_pandas() # hangs

I believe it hangs because of the number of files is high 115K+.

So for my PoC, I wanted to read files only for a day or hour. So, I tried to set the table_path variable up to the hour, but it gives Not a Delta table error as, showing below:

table_path = "s3://example_bucket/data/poc"
dt = DeltaTable(table_path)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib64/python3.7/site-packages/deltalake/table.py", line 19, in __init__
    self._table = RawDeltaTable(table_path, version=version)
deltalake.PyDeltaTableError: Not a Delta table

How can I achieve this?

If deltalake Python library can't be used to achieve this, what other tools/libraries are there I should try?

Rafiq
  • 1,380
  • 4
  • 16
  • 31

0 Answers0