I have tiff images stored in tar files in HDFS. I can download the tar file and stream from it in this way:
tar = tarfile.open("filename.tar", 'r|')
for tiff in tar:
if tiff.isfile():
a = tar.extractfile(tiff).read()
na = np.frombuffer(c, dtype=np.uint8)
im = cv2.imdecode(na, cv2.IMREAD_COLOR)
which gives me a numpy array. I want to see if there is a way to stream tiff files directly from the tar files in hdfs.
Here is what I have:
import pyarrow as pa
fs = pa.hdfs.connect()
with fs.open(hdfs_path_to_tar_file, 'rb') as f:
print(type(f))
<class 'pyarrow.lib.HdfsFile'>
I don't know how to read it with tarfile
. I need to convert it to a bytes type object that I can read with tarfile.open
. But I don't want to read the whole file at first. tar files are pretty huge so I don't want to put them in the memory i.e f.read()
returns bytes but puts the whole thing in the memory. Although, tarfile.open
couldn't read that, too.