3

I have a fits file with ~50GB, containing multiple HDUs, which all have the same format: an (1E5 x 1E6) array with 1E5 objects and 1E6 time stamps. The HDUs describe different physical properties such as Flux, RA, DEC etc. I want to read just 5 objects from each HDU (i.e. an (5 x 1E6) array).

python 2.7, astropy 1.0.3, linux x86_64

I so far tried a lot of suggestions I found, but nothing worked. My best approach is still:

#the five objects I want to read out
obj_list = ['Star1','Star15','Star700','Star2000','Star5000'] 
dic = {}

with fits.open(fname, memmap=True, do_not_scale_image_data=True) as hdulist:

    # There is a special HDU 'OBJECTS' which is an (1E5 x 1) array and contains the info which index in the fits file corresponds to which object.

    # First, get the indices of the rows that describe the objects in the fits file (not necessarily in order!)
    ind_objs = np.in1d(hdulist['OBJECTS'].data, obj_list, assume_unique=True).nonzero()[0] #indices of the candidates     

    # Second, read out the 5 object's time series
    dic['FLUX'] = hdulist['FLUX'].data[ind_objs] # (5 x 1E6) array
    dic['RA'] = hdulist['RA'].data[ind_objs] # (5 x 1E6) array
    dic['DEC'] = hdulist['DEC'].data[ind_objs] # (5 x 1E6) array

This code works well and fast for fits files up to ~20 GB, but runs out of memory for bigger ones (larger files only contain more objects, not more time stamps). I do not understand why though - astropy.io.fits intrinsically uses mmap and should only load the (5x1E6) arrays into memory as far as I understand? So independent on the file size, what I want to read out always has the same size.

Edit - This is the error message:

  dic['RA'] = hdulist['RA'].data[ind_objs] # (5 x 1E6) array
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/utils/decorators.py", line 341, in __get__
  val = self._fget(obj)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/hdu/image.py", line 239, in data
  data = self._get_scaled_image_data(self._data_offset, self.shape)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/hdu/image.py", line 585, in _get_scaled_image_data
  raw_data = self._get_raw_data(shape, code, offset)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/hdu/base.py", line 523, in _get_raw_data
  return self._file.readarray(offset=offset, dtype=code, shape=shape)
File "/usr/local/python/lib/python2.7/site-packages/astropy-1.0.3-py2.7-linux-x86_64.egg/astropy/io/fits/file.py", line 248, in readarray
  shape=shape).view(np.ndarray)
File "/usr/local/python/lib/python2.7/site-packages/numpy/core/memmap.py", line 254, in __new__
  mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
mmap.error: [Errno 12] Cannot allocate memory

Edit 2: Thanks, I now included the suggestions, it enables me to handle a fits file up to 50GB. New code:

#the five objects I want to read out
obj_list = ['Star1','Star15','Star700','Star2000','Star5000'] 
dic = {}

with fits.open(fname, mode='denywrite', memmap=True, do_not_scale_image_data=True) as hdulist:

    # There is a special HDU 'OBJECTS' which is an (1E5 x 1) array and contains the info which index in the fits file corresponds to which object.

    # First, get the indices of the rows that describe the objects in the fits file (not necessarily in order!)
    ind_objs = np.in1d(hdulist['OBJECTS'].data, obj_list, assume_unique=True).nonzero()[0] #indices of the candidates     

    # Second, read out the 5 object's time series
    dic['FLUX'] = hdulist['FLUX'].data[ind_objs] # (5 x 1E6) array
    del hdulist['FLUX'].data
    dic['RA'] = hdulist['RA'].data[ind_objs] # (5 x 1E6) array
    del hdulist['RA'].data
    dic['DEC'] = hdulist['DEC'].data[ind_objs] # (5 x 1E6) array
    del hdulist['DEC'].data

The

mode='denywrite'

did not cause any change.

memmap=True 

is indeed not the default and needs to be set manually.

del hdulist['FLUX'].data 

etc now allows me to read 50GB instead of 20GB files

New issue: Anything larger than 50GB still causes the same memory error - now, however, directly in the first line.

dic['FLUX'] = hdulist['FLUX'].data[ind_objs] # (5 x 1E6) array
MaxG
  • 243
  • 4
  • 8
  • At what point/line does it run out of memory? –  Mar 02 '16 at 23:23
  • Note that the [documentation says](http://astropy.readthedocs.org/en/stable/io/fits/index.html#working-with-large-files): "The open() function supports a memmap=True argument that allows the array data of each HDU to be accessed with mmap, rather than being read into memory all at once". So it doesn't appear to support it automatically (the `memmap` default value is `None`, but you'll need to supply it in `fits.open(filename, memmap=True)`. –  Mar 02 '16 at 23:28
  • @Evert Thanks a lot, I edited my post to include the error message. It breaks after succesfully reading the first HDU(s) (for smaller files it reads more HDUs and breaks later). Thanks for pointing out that memmap's default is None, other [astropy docs](http://docs.astropy.org/en/stable/io/fits/appendix/faq.html) actually say it is True by default, but I could confirm your point (unfortunately it did not solve the issue it though). – MaxG Mar 03 '16 at 00:55
  • Just a note on the memmap default, because that could be a bug in the docs: the FAQ you point to has the following about [opening large files](http://docs.astropy.org/en/stable/io/fits/appendix/faq.html#how-do-i-open-a-very-large-image-that-won-t-fit-in-memory): "To ensure use of memory mapping, just add the memmap=True argument to fits.open". Could you point me to the line or quote it where you find that the default is to use memmapping? Or have you set `USE_MEMMAP` somewhere? –  Mar 03 '16 at 01:27
  • Alternatively on that memmap default, perhaps you read the sentence "In PyFITS, prior to version 3.1, when the data portion of an HDU is accessed, the data is read into memory in its entirety. " as meaning that since 3.1, data is not read fully into memory (But memmapped). I interpret it, however, that the default behaviour hasn't changed, but (as per the next paragraph) that a `memmap` has been added since 3.1 (with default to `None` to enable the old behaviour). Perhaps that's your confusion about the `memmap` default? –  Mar 03 '16 at 01:29
  • At a guess to the cause: numpy does require the whole array (`hdulist['RA'].data[ind_objs]`) to be in memory before it can index it. Is the size of a single HDU something that could fit into memory on your machine? (It seems so, as it appears to work for the FLUX HDU.) Because then it feels that the previous data is not released/flushed (which I'd think should happen when memmapped, but I'm definitely no expert there). –  Mar 03 '16 at 01:42
  • The second half of this [io.fits FAQ](http://docs.astropy.org/en/stable/io/fits/appendix/faq.html#i-m-opening-many-fits-files-in-a-loop-and-getting-oserror-too-many-open-files) may have something that could work in your case (even though the FAQ is related to another problem): make sure the `.data` attribute is gone before moving to the next HDU. –  Mar 03 '16 at 01:42
  • @Evert Thanks a lot, I tried the 'del hdulist['FLUX'].data' etc. before, but never in combination with the other adjustments I now have. This allows me to now read a line from a fits file that is up to 50GB, but still struggling to read a line from a 100GB fits file... (see edit 2) – MaxG Mar 03 '16 at 11:02
  • It'll crash in the first line now, because that particular HDU is now simpy too big to fit into memory at once; which, as stated earlier, is the requirement for numpy to index it. Since astropy.io.fits uses numpy under the hood, you'd have to wait until numpy allows indexing on a file directly. Which may never happen. As per Iguananaut's remark: HDF5 does allow indexing into a file on disk, and is probably capable of reading your file properly. (Unfortunately, I'm guessing a simple conversion from FITS to HDF5 is impossible, as the FITS HDUs will have to be, again, read at once.) –  Mar 03 '16 at 12:36

1 Answers1

4

It looks like you've run across this issue: https://github.com/astropy/astropy/issues/1380

The problem here is that even though it's using mmap, it's using mmap in copy-on-write mode which means your system needs to be able to allocate a large enough virtual memory area that in principle can hold as much data as the size of the mmap, in case you write data back to the mmap.

If you pass mode='denywrite' to fits.open() it should work. Any attempt to modify the array will result in an error, but that's fine if all you want to do is read the data.

If you still can't get that to work you might also try the fitsio module which has better support for reading files in smaller chunks.

Iguananaut
  • 21,810
  • 5
  • 50
  • 63
  • All that said--who's making these FITS files? FITS is not well suited for huge files, and every time someone tries to use it for such they seem to run into problems even with CFITSIO. Part of the problem is that FITS is not well suited for manipulating data in large files. Please consider using something like HDF5 instead. – Iguananaut Mar 03 '16 at 08:30
  • You should also update to Astropy 1.1.x which has slightly better handling in cases like this I think. – Iguananaut Mar 03 '16 at 08:32
  • Thanks for your comments, I tried that before, but mode='denywrite' did not solve the memory problem. I do include it now in my edited question, just to be complete. I indeed only want to read the line from the fits file, not manipulate it. Also, thanks, I'll look into the astropy update. – MaxG Mar 03 '16 at 10:59
  • If `mode='denywrite'` doesn't work what about `mode='update'` out of curiosity. If that still doesn't work I don't know why you would be getting a memory error from mmap. In any case all the other details are a distraction. For whatever reason you can't mmap for than some limit on your system. Could be a variety of reasons for that. You might consider instead reading your data in smaller chunks at a time, which `fitsio` is better at, or as I suggested earlier not using FITS for such large data. – Iguananaut Mar 03 '16 at 13:12