I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, I often get 0%-300% more data sent over the network. My suspicion is that pyarrow is reading ahead.
The pyarrow parquet reader doesn't have this behavior, and I am looking for a way to turn off read ahead for the general HDFS interface.
I am running on ubuntu 14.04. This issue is present in pyarrow 0.10 - 0.13 (newest released version). I am on python 2.7
I have been using wireshark to track the packets passed on the network.
I suspect it is read ahead since the time for the 1st read is much greater than the time for 2nd read.
The regular pyarrow reader
import pyarrow as pa
fs = pa.hdfs.connect(hostname)
file_path = 'dataset/train/piece0000'
f = fs.open(file_path)
f.seek(0)
n_bytes = 3000000
f.read(n_bytes)
Parquet code without the same issue
parquet_file = 'dataset/train/parquet/part-22e3'
pf = fs.open(parquet_path)
pqf = pa.parquet.ParquetFile(pf)
data = pqf.read_row_group(0, columns=['col_name'])