I have a fits file with ~50GB, containing multiple HDUs, which all have the same format: an (1E5 x 1E6) array with 1E5 objects and 1E6 time stamps. The HDUs describe different physical properties such as Flux, RA, DEC etc. I want to read just 5 objects from each HDU (i.e. an (5 x 1E6) array).
python 2.7, astropy 1.0.3, linux x86_64
I so far tried a lot of suggestions I found, but nothing worked. My best approach is still:
#the five objects I want to read out
obj_list = ['Star1','Star15','Star700','Star2000','Star5000']
dic = {}
with fits.open(fname, memmap=True, do_not_scale_image_data=True) as hdulist:
# There is a special HDU 'OBJECTS' which is an (1E5 x 1) array and contains the info which index in the fits file corresponds to which object.
# First, get the indices of the rows that describe the objects in the fits file (not necessarily in order!)
ind_objs = np.in1d(hdulist['OBJECTS'].data, obj_list, assume_unique=True).nonzero()[0] #indices of the candidates
# Second, read out the 5 object's time series
dic['FLUX'] = hdulist['FLUX'].data[ind_objs] # (5 x 1E6) array
dic['RA'] = hdulist['RA'].data[ind_objs] # (5 x 1E6) array
dic['DEC'] = hdulist['DEC'].data[ind_objs] # (5 x 1E6) array
This code works well and fast for fits files up to ~20 GB, but runs out of memory for bigger ones (larger files only contain more objects, not more time stamps). I do not understand why though - astropy.io.fits intrinsically uses mmap and should only load the (5x1E6) arrays into memory as far as I understand? So independent on the file size, what I want to read out always has the same size.
Edit - This is the error message:
dic['RA'] = hdulist['RA'].data[ind_objs] # (5 x 1E6) array
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/utils/decorators.py", line 341, in __get__
val = self._fget(obj)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/hdu/image.py", line 239, in data
data = self._get_scaled_image_data(self._data_offset, self.shape)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/hdu/image.py", line 585, in _get_scaled_image_data
raw_data = self._get_raw_data(shape, code, offset)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/hdu/base.py", line 523, in _get_raw_data
return self._file.readarray(offset=offset, dtype=code, shape=shape)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/file.py", line 248, in readarray
shape=shape).view(np.ndarray)
File "/usr/local/python/lib/python2.7/site-packages/numpy/core/memmap.py", line 254, in __new__
mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
mmap.error: [Errno 12] Cannot allocate memory
Edit 2: Thanks, I now included the suggestions, it enables me to handle a fits file up to 50GB. New code:
#the five objects I want to read out
obj_list = ['Star1','Star15','Star700','Star2000','Star5000']
dic = {}
with fits.open(fname, mode='denywrite', memmap=True, do_not_scale_image_data=True) as hdulist:
# There is a special HDU 'OBJECTS' which is an (1E5 x 1) array and contains the info which index in the fits file corresponds to which object.
# First, get the indices of the rows that describe the objects in the fits file (not necessarily in order!)
ind_objs = np.in1d(hdulist['OBJECTS'].data, obj_list, assume_unique=True).nonzero()[0] #indices of the candidates
# Second, read out the 5 object's time series
dic['FLUX'] = hdulist['FLUX'].data[ind_objs] # (5 x 1E6) array
del hdulist['FLUX'].data
dic['RA'] = hdulist['RA'].data[ind_objs] # (5 x 1E6) array
del hdulist['RA'].data
dic['DEC'] = hdulist['DEC'].data[ind_objs] # (5 x 1E6) array
del hdulist['DEC'].data
The
mode='denywrite'
did not cause any change.
memmap=True
is indeed not the default and needs to be set manually.
del hdulist['FLUX'].data
etc now allows me to read 50GB instead of 20GB files
New issue: Anything larger than 50GB still causes the same memory error - now, however, directly in the first line.
dic['FLUX'] = hdulist['FLUX'].data[ind_objs] # (5 x 1E6) array