0

I need to extract information from a few columns in ~20k different .fits files with python (.fits files are tabular files often used in astrophysics). Each file is relatively small, ~0.2MB. I have been doing this so far with a loop and astropy like this:

from astropy.io import fits

data = []
for file_name in fits_files_list:
    with fits.open(file_name, memmap=False) as hdulist:
        lam = np.around(10**hdulist[1].data['loglam'], 4)
        flux = np.around(hdulist[1].data['flux'], 4)
        z = np.around(hdulist[2].data['z'], 4)
    data.append([lam, flux, z])

This takes for the 20k fits files ~2.5 hours and from time to time I need to loop through the files for other reasons. I coding this loop on a google colab notebook with my files stored in my google drive.

So my question is: Can I minimize the time for looping? Do you know of other packages besides astropy that would help me with that? Or do you know if I can change my algorithm to make it run faster, e.g. somehow vectorize the loop? Or some software to stack quickly 20k fits files into one fits-file (TOPCAT has no function that does this for more than 2 files)? Tnx!

NeStack
  • 1,739
  • 1
  • 20
  • 40
  • You will still need to open and read each file, however you can do it in parallel manner, this way you'd probably be bound by your IO bandwidth. You can do it by hand, using [python threads](https://docs.python.org/3/library/threading.html), or you can use higher-level frameworks for distributed and parallel computation, like [PySpark](http://spark.apache.org/docs/latest/api/python/). – Aivean Nov 09 '21 at 23:12
  • With such an execution time, it is pretty clear that `fits.open(file_name, memmap=False)` is the slow operation, or you may run out of memory (swapping), or the generated output is huge although the input is small. There is not much to do apart from optimizing the library. The disk bandwidth is ridiculously small (<1MB/s) and so clearly not saturated. You can [open a performance issue](https://github.com/astropy/astropy/issues) on the library github page and/or possibly help them to improve it. Alternatively you can use another library if any... – Jérôme Richard Nov 10 '21 at 00:27
  • 2
    @JérômeRichard I'm the former maintainer of this library, and `fits.open(file_name, memmap=False)` is most certainly *not* the slow operation here in of itself, since it does next to nothing. Most access to the file is very lazy. The slow part would have to be actually reading the data, which could still be I/O bottlenecked especially if they're on some distributed filesystem. But it's hard to say. – Iguananaut Nov 10 '21 at 09:56
  • 20k files in ~2.5 hours is about half a second per file which does seem rather slow for such small files. Is there anything else to know about the files? What does a typical header look like, and is it using compression? In some cases the [fitsio](https://github.com/esheldon/fitsio) package has better performance tough. – Iguananaut Nov 10 '21 at 09:57
  • Ok. The fact that the files could be on a slow NFS or distributed FS is a good idea. What are the timings with a `with open(file_name) as hdulist: hdulist.read()` ? – Jérôme Richard Nov 10 '21 at 11:39
  • @Iguananaut Thanks for answering! The fits-files contain spectral data of QSOs from the SDSS data base, here is one of the files, all of them them have virtually the same format: https://drive.google.com/file/d/1qMF-HV3_zDmKo-Zt-n8gVSbmlhQKemIe/view?usp=sharing I don't know if it matters, but I am coding in a google colab environment, not utilizing the GPU or TPU provided, because with them the code runs even slower. The files are stored for that in my google drive. I tried using fitsio, but the time performance was no better. Any new ideas? – NeStack Nov 10 '21 at 12:33
  • 2
    Try to copy the files from the google drive to a local storage where they are computed. You can possibly store them in a TMPFS (in `/tmp`) or a RAMFS to avoid very fetches from the network. – Jérôme Richard Nov 10 '21 at 17:25
  • @JérômeRichard Thank you! Is it be expected the loop to be slower in google colab, than locally on my computer? Is there a reason for it? – NeStack Nov 10 '21 at 19:14
  • 1
    @NeStack, you should edit the post to mention colab environment. Yes, colab is slower than your local machine, because they give you [a slow virtual CPU](https://stackoverflow.com/questions/47805170/whats-the-hardware-spec-for-google-colaboratory), and, if you're accessing files on a mounted google drive, files are transferred over the network first, which is slower than reading files from the local drive. – Aivean Nov 10 '21 at 20:26
  • If it's possible, store your files on google drive in one or several compressed archives, and, before processing, download and decompress them to the local drive. – Aivean Nov 10 '21 at 20:29
  • @Aivean Thanks for the explanation! I also included the colab info in my question. Too bad, that it is colab that slows the loop, I am using an institutional gmail account with unlimited storage that is else very convenient – NeStack Nov 10 '21 at 22:49
  • 1
    @NeStack, there is a simple test that you can do. Try copying the files from the mounted google drive to local folder (in colab) using `rsync` ( `!rsync --progress /content/drive/MyDrive/source_dir /content/target_dir/` ) and see how long it takes. – Aivean Nov 10 '21 at 22:55

0 Answers0