1

I have a large number of compressed HDF files, which I need to read.

file1.HDF.gz
file2.HDF.gz
file3.HDF.gz
...

I can read in uncompressed HDF files with the following method

from pyhdf.SD import SD, SDC
import os

os.system('gunzip < file1.HDF.gz >  file1.HDF')
HDF = SD('file1.HDF')

and repeat this for each file. However, this is more time consuming than I want.

I'm thinking its possible that most of the time overhang comes from writing the compressed file to a new uncompressed version, and that I could speed it up if I simply was able to read an uncompressed version of the file into the SD function in one step.

Am I correct in this thinking? And if so, is there a way to do what I want?

hm8
  • 1,381
  • 3
  • 21
  • 41
  • That's awkward. The correct usage would be transparent-compression within hdf (so you don't have to care during writing and reading)! This setup you describe is usable for archival-only (as compression is an extra-layer hdf does not know about). You did not specify your use-case, but in some cases (you want to read many iterations from these): transform each to a new hdf with compression on (or just decompress if memory is not a problem)! **Remark** python also supports many decompression-tools without your a file-based pipeline. – sascha Aug 28 '17 at 20:36
  • One would really have to look at the details of `pyhdf` to have a good answer here -- one can get a file-like object corresponding to a gzipped stream in Python, but would need to know if a file-like object is good enough or if the pyhdf library requires a real file (or, worse, a filename so it can open the file itself). – Charles Duffy Aug 28 '17 at 20:41
  • (even if it really does want a filename, one could play tricks with FIFOs *if* pyhdf doesn't need its input files to be seekable, but again, that's a bit of investigation one would have to do into the details of that library's implementation). – Charles Duffy Aug 28 '17 at 20:42

2 Answers2

4

According to the pyhdf package documentation, this is not possible.

__init__(self, path, mode=1)
  SD constructor. Initialize an SD interface on an HDF file,
  creating the file if necessary.

There is no other way to instantiate an SD object that takes a file-like object. This is likely because they are conforming to an external interface (NCSA HDF). The HDF format also normally handles massive files that are impractical to store in memory at one time.

Unzipping it as a file is likely your most performant option.

If you would like to stay in Python, use the gzip module (docs):

import gzip
import shutil
with gzip.open('file1.HDF.gz', 'rb') as f_in, open('file1.HDF', 'wb') as f_out:
    shutil.copyfileobj(f_in, f_out)
Bruce Wilis
  • 142
  • 2
  • 6
0

sascha is correct that hdf transparent compression is more adequate than gzipping, nonetheless if you can't control how the hdf files are stored you're looking for the gzip python modulue (docs) it can get the data from these files.

chicocvenancio
  • 629
  • 5
  • 14
  • Can you give me an example of how to use the gzip module in this case? – hm8 Aug 28 '17 at 20:42
  • An answer is expected to *answer the question*, not point someone to where they can find an answer. Links should be supplemental, not core to the answer itself. – Charles Duffy Aug 28 '17 at 20:43
  • 1
    More to the point -- if the `gzip` module returns a file-like object, then the answer is only an acceptable one if the pyhdf library can actually use that object. This is a fact-intensive investigation, and an answer that hasn't had code written presumably hasn't had such investigation performed. – Charles Duffy Aug 28 '17 at 20:44
  • As @kevin-mcdonough demonstrated above the hdf c api does not make it simple to pass python file-like objects to it and both `pydhf` and `pytables` do not allow it at this time. Sorry for not noticing this before posting. – chicocvenancio Aug 28 '17 at 21:18