I need to process over 10 million spectroscopic data sets. The data is structured like this: there are around 1000 .fits (.fits is some data storage format) files, each file contains around 600-1000 spectra in which there are around 4500 elements in each spectra (so each file returns a ~1000*4500 matrix). That means each spectra is going to be repeatedly read around 10 times (or each file is going to be repeatedly read around 10,000 times) if I am going to loop over the 10 million entries. Although the same spectra is repeatedly read around 10 times, it is not duplicate because each time I extract different segments of the same spectra.
I have a catalog file which contains all the information I need, like the coordinates x
, y
, the radius r
, the strength s
, etc. The catalog also contains the information to target which file I am going to read (identified by n1
, n2
) and which spectra in that file I am going to use (identified by n3
).
The code I have now is:
import numpy as np
from itertools import izip
import fitsio
x = []
y = []
r = []
s = []
n1 = []
n2 = []
n3 = []
with open('spectra_ID.dat') as file_ID, open('catalog.txt') as file_c:
for line1, line2 in izip(file_ID,file_c):
parts1 = line1.split()
parts2 = line2.split()
n1.append(parts1[0])
n2.append(parts1[1])
n3.append(float(parts1[2]))
x.append(float(parts2[0]))
y.append(float(parts2[1]))
r.append(float(parts2[2]))
s.append(float(parts2[3]))
def data_analysis(idx_start,idx_end): #### loop over 10 million entries
data_stru = np.zeros((idx_end-idx_start), dtype=[('spec','f4',(200)),('x','f8'),('y','f8'),('r','f8'),('s','f8')])
for i in xrange(idx_start,idx_end)
filename = "../../../data/" + str(n1[i]) + "/spPlate-" + str(n1[i]) + "-" + str(n2[i]) + ".fits"
fits_spectra = fitsio.FITS(filename)
fluxx = fits_spectra[0][n3[i]-1:n3[i],0:4000] #### return a list of list
flux = fluxx[0]
hdu = fits_spectra[0].read_header()
wave_start = hdu['CRVAL1']
logwave = wave_start + 0.0001 * np.arange(4000)
wavegrid = np.power(10,logwave)
##### After I read the flux and the wavegrid, then I can do my following analysis.
##### save data to data_stru
##### Reading is the most time-consuming part of this code, my later analysis is not time consuming.
The problem is that the files are too big, there is no enough memory to load it at once, and my catalog is not structured such that all entries which will open the same file are grouped together. I wonder is there anyone who can offer some thoughts to split the large loop into two loops: 1) first loop over the files so that we can avoid repeatedly opening/reading files again and again, 2) loop over the entries which are going to use the same file.