I’m trying to stream data from parquet files stored in Dropbox (but it could be somewhere else, S3, gdrive, etc…) and reading in Pandas, while caching it. For that I’m trying to use fsspec for Python
Following these instructions that’s what I’m trying right now:
from fsspec.implementations.arrow import ArrowFSWrapper
from fsspec.implementations.cached import CachingFileSystem
import pandas as pd
cfs = CachingFileSystem(target_protocol="http", cache_storage="cache_fs")
cfs_arrow = ArrowFSWrapper(cfs)
url = "https://www.dropbox.com/s/…./myfile.parquet?dl=0"
f = cfs_arrow.open(url, "rb")
df = pd.read_parquet(f)
but this raises the following error at cfs_arrow.open(url, "rb")
:
AttributeError: type object 'HTTPFileSystem' has no attribute 'open_input_stream'
I’ve used fsspec CachingFileSystem
before to stream hdf5 data from S3, so I presumed it would work out-of-the-box, but I’m probably doing something wrong.
Can someone help me with that? Or other suggestions on how to accomplish the goal of streaming my tabular data while keeping a cache for fast later access in the same session?