2

I need to process over 10 million spectroscopic data sets. The data is structured like this: there are around 1000 .fits (.fits is some data storage format) files, each file contains around 600-1000 spectra in which there are around 4500 elements in each spectra (so each file returns a ~1000*4500 matrix). That means each spectra is going to be repeatedly read around 10 times (or each file is going to be repeatedly read around 10,000 times) if I am going to loop over the 10 million entries. Although the same spectra is repeatedly read around 10 times, it is not duplicate because each time I extract different segments of the same spectra.

I have a catalog file which contains all the information I need, like the coordinates x, y, the radius r, the strength s, etc. The catalog also contains the information to target which file I am going to read (identified by n1, n2) and which spectra in that file I am going to use (identified by n3).

The code I have now is:

import numpy as np
from itertools import izip
import fitsio

x = []
y = []
r = []
s = []
n1 = []
n2 = []
n3 = []
with open('spectra_ID.dat') as file_ID, open('catalog.txt') as file_c:
    for line1, line2 in izip(file_ID,file_c):
        parts1 = line1.split()
        parts2 = line2.split()
        n1.append(parts1[0])
        n2.append(parts1[1])
        n3.append(float(parts1[2]))
        x.append(float(parts2[0]))         
        y.append(float(parts2[1]))        
        r.append(float(parts2[2]))
        s.append(float(parts2[3]))  

def data_analysis(idx_start,idx_end):  #### loop over 10 million entries
    data_stru = np.zeros((idx_end-idx_start), dtype=[('spec','f4',(200)),('x','f8'),('y','f8'),('r','f8'),('s','f8')])

    for i in xrange(idx_start,idx_end)
        filename = "../../../data/" + str(n1[i]) + "/spPlate-" + str(n1[i]) + "-" + str(n2[i]) + ".fits"
        fits_spectra = fitsio.FITS(filename)
        fluxx = fits_spectra[0][n3[i]-1:n3[i],0:4000]  #### return a list of list
        flux = fluxx[0]
        hdu = fits_spectra[0].read_header()
        wave_start = hdu['CRVAL1']
        logwave = wave_start + 0.0001 * np.arange(4000)
        wavegrid = np.power(10,logwave)

    ##### After I read the flux and the wavegrid, then I can do my following analysis.

    ##### save data to data_stru

    ##### Reading is the most time-consuming part of this code, my later analysis is not time consuming.

The problem is that the files are too big, there is no enough memory to load it at once, and my catalog is not structured such that all entries which will open the same file are grouped together. I wonder is there anyone who can offer some thoughts to split the large loop into two loops: 1) first loop over the files so that we can avoid repeatedly opening/reading files again and again, 2) loop over the entries which are going to use the same file.

Huanian Zhang
  • 830
  • 3
  • 16
  • 37

1 Answers1

1

If I understand your code correctly, n1 and n2 determine which file to open. So why do you not just lexsort them. You can then use itertools.groupby to group records with the same n1, n2. Here is a down-scaled proof of concept:

import itertools

n1 = np.random.randint(0, 3, (10,))
n2 = np.random.randint(0, 3, (10,))
mockdata = np.arange(10)+100

s = np.lexsort((n2, n1))

for k, g in itertools.groupby(zip(s, n1[s], n2[s]), lambda x: x[1:]):
    # groupby groups the iterations i of its first argument
    # (zip(...) in this case) by the result of applying the
    # optional second argument (here lambda) to i.
    # Here we use the lambda expression to remove si from the
    # tuple (si, n1si, n2si) that zip produces because otherwise
    # equal (n1si, n2si) pairs would still be treated as different
    # because of the distinct si's. Hence no grouping would occur.
    # Putting si in there in the first place is necessary, so we
    # we can reference the other records of the corresponding row
    # in the inner loop.
    print(k)
    for si, n1s, ns2 in g:
        # si can be used to access the corresponding other records
        print (si, mockdata[si])

Prints something like:

(0, 1)
4 104
(0, 2)
0 100
2 102
6 106
(1, 0)
1 101
(2, 0)
8 108
9 109
(2, 1)
3 103
5 105
7 107

You may want to include n3 in the lexsort, but not the grouping so you can process the files' content in order.

Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • I kind of understand your code. The first loop is used to read the files, the second loop is looping over all the entries which belongs to the same files, right? I shall use k[0] and k[1] to open the file, and then process the spectra in the second for loop. I am giving a try. Thanks. – Huanian Zhang Apr 08 '17 at 04:32
  • You are welcome. Let me know whether you get it to work. – Paul Panzer Apr 08 '17 at 05:02
  • I am giving a try this weekend. It might take a while for me to implement it. You help is highly appreciated. I will let you know when it works. Have a good weekend. – Huanian Zhang Apr 08 '17 at 05:57
  • I implement it and it works well. I have a little question: in the `groupby` sentence, what is the meaning of the `lambda` function. One suggestion: if you can explain a little more about your code, then other people who have the same problem can understand it quickly and correctly. Thanks a lot. – Huanian Zhang Apr 08 '17 at 18:50
  • @HuanianZhang Congratulations :-) I added a comment to the code hopefully explaining the `groupby` statement. – Paul Panzer Apr 08 '17 at 19:59
  • Hi Paul, I am trying to implement the code on the high performance computing system. So I need to modify the code using `multiprocessing`, I wonder do you have any thoughts how to modify the code? Thanks a lot. – Huanian Zhang Apr 10 '17 at 04:51
  • @HuanianZhang I'm not good with `multiprocessing` but here are a few thoughts. 1) the first step (sorting the catalog) unless it takes a very long time can probably not reasonably be sped up. You could try splitting it up into chunks and sort them in different processes and then sort the presorted chunks. Since `lexsort` uses the mergesort algorithm in theory this could gain a bit of speed but the parallelisation overheads will possibly eat up any gains. 2) I'm not sure how compatible the multiprocessing looping structures (`Pool`?) are with `groupby`. You'd certainly want to assign entire – Paul Panzer Apr 10 '17 at 12:06
  • @HuanianZhang groups to a given worker. Perhaps it'd be faster to first compute the groups (using `np.diff` and `np.where` on the sorted `n1, n2`) and then split the work roughly evenly between the processes. 3) You'll probably have to serialise at least the writing of files using locks or something similar. 4) Anyway, you may want to post a new question and hopefully receive an answer from some one who knows more about `multiprocessing` than I do. – Paul Panzer Apr 10 '17 at 12:11
  • Thank you very much. I may post another question about the multiprocessing on HPC. Have a nice week. – Huanian Zhang Apr 10 '17 at 15:19