0

I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, I often get 0%-300% more data sent over the network. My suspicion is that pyarrow is reading ahead.

The pyarrow parquet reader doesn't have this behavior, and I am looking for a way to turn off read ahead for the general HDFS interface.

I am running on ubuntu 14.04. This issue is present in pyarrow 0.10 - 0.13 (newest released version). I am on python 2.7

I have been using wireshark to track the packets passed on the network.

I suspect it is read ahead since the time for the 1st read is much greater than the time for 2nd read.

The regular pyarrow reader

import pyarrow as pa
fs = pa.hdfs.connect(hostname)

file_path = 'dataset/train/piece0000'
f = fs.open(file_path)
f.seek(0)
n_bytes = 3000000
f.read(n_bytes)

Parquet code without the same issue

parquet_file = 'dataset/train/parquet/part-22e3'
pf = fs.open(parquet_path)
pqf = pa.parquet.ParquetFile(pf)
data = pqf.read_row_group(0, columns=['col_name'])
Iva
  • 1
  • 1
  • It'd be better for you to open an issue on the Apache Arrow JIRA project so we can debug there. Since we are using libhdfs there is some behavior that isn't totally under our control – Wes McKinney May 16 '19 at 21:03
  • Hey, I have a ticket open (https://issues.apache.org/jira/browse/ARROW-5318) – Iva May 16 '19 at 21:24

1 Answers1

0

Discussed in the JIRA ticket: https://issues.apache.org/jira/browse/ARROW-5432

A read_at function is being added to pyarrow api that will allow you to read a file at an offset for a certain length with no reading ahead.

Iva
  • 1
  • 1