Streaming and caching tabular data with fsspec, parquet and Pyarrow

Question

I’m trying to stream data from parquet files stored in Dropbox (but it could be somewhere else, S3, gdrive, etc…) and reading in Pandas, while caching it. For that I’m trying to use fsspec for Python

Following these instructions that’s what I’m trying right now:

from fsspec.implementations.arrow import ArrowFSWrapper
from fsspec.implementations.cached import CachingFileSystem
import pandas as pd

cfs = CachingFileSystem(target_protocol="http", cache_storage="cache_fs")
cfs_arrow = ArrowFSWrapper(cfs)

url = "https://www.dropbox.com/s/…./myfile.parquet?dl=0"
f = cfs_arrow.open(url, "rb")
df = pd.read_parquet(f)

but this raises the following error at cfs_arrow.open(url, "rb"):

AttributeError: type object 'HTTPFileSystem' has no attribute 'open_input_stream'

I’ve used fsspec CachingFileSystem before to stream hdf5 data from S3, so I presumed it would work out-of-the-box, but I’m probably doing something wrong.

Can someone help me with that? Or other suggestions on how to accomplish the goal of streaming my tabular data while keeping a cache for fast later access in the same session?

This should help: https://arrow.apache.org/docs/python/filesystems.html#using-fsspec-compatible-filesystems-with-arrow — 0x26res, Oct 27 '22 at 11:52

score 3 · Accepted Answer · answered Oct 27 '22 at 13:27

The convenience way to open and pass a file-like object using fsspec alone would be

with fsspec.open(
    "blockcache::https://www.dropbox.com/s/…./myfile.parquet?dl=0",
    blockcache={"cache_storage": "cache_fs"}
) as f:
    df = pd.read_parquet(f)

Of course, instantiating your own filesystem instance is fine too. You may be interested, that there is a dropbox backend to fsspec too, useful for finding and manipulating files. Also, there is an fsspec.parquet module for optimising parquet access when you need only some of the row-groups or columns of the target.

score 1 · Answer 2 · answered Oct 27 '22 at 12:11

My understanding is that there are 2 types of file systems:

fsspec
arrow

You need an arrow file system if you are going to call pyarrow functions directly. If you have an fsspec file system (eg: CachingFileSystem) and want to use pyarrow, you need to wrap your fsspec file system using this:

from pyarrow.fs import PyFileSystem, FSSpecHandler
pa_fs = PyFileSystem(FSSpecHandler(fs))

ArrowFSWrapper is to go the other way around (from a pyarrow file system to a fsspec file system).

But in your case given ultimately you are using a file object (not a file system) to call pd.read_parquet you can use your fsspec file system (ie CachingFileSystem) directly.

Streaming and caching tabular data with fsspec, parquet and Pyarrow

2 Answers2