How could I speed up this function

Question

I am downloading a star catalog from Vizier (using astroquery). The catalog concerned does not include star names so I am getting these from SIMBAD (also using astroquery) by querying all SIMBAD stars within 1 arcsec of each of my Vizier catalog stars.

I need to then perform a match by ra/dec coordinates. However both the Vizier and SIMBAD coordinates may be slightly inaccurate so I can't do an exact match.

My current solution is to specify a tolerance and, for each Vizier star, call the function below to loop through the SIMBAD stars, testing whether the coordinates agree within the specified tolerance. As a double-check, because stars can be very close together, I also check whether the star magnitudes match to within 0.1 mag.

This all works but for a Vizier catalog of c.2,000 stars and a SIMBAD dataset of similar size it takes over 2 minutes to run. I'm looking for ideas to speed this up.

    def get_simbad_name(self, vizier_star, simbad_stars, tolerance):
        """
        Searches simbad_stars to find the SIMBAD name of the star 
        referenced in vizier_star.

        A match is deemed to exist if a star in simbad_stars has both 
        ra and dec +/- tolerance of the target vizier_star and if their V 
        magnitudes, rounded to one decimal place, also match.

        Parameters
        ==========
        vizier_star : astropy.table.Row
            Row of results from Vizier query, corresponding to a star in a 
            Vizier catalog. Columns of interest to this function are:

            '_RAJ2000' : float [Right ascension in decimal degrees]
            '_DEJ2000' : float [Declination in decimal degrees]
            'Vmag' : float [V magnitude (to 3 decimal places)]

        simbad_stars : list of dict
            List of star data derived from a Vizier query. Keys of interest 
            to this function are:

            'ra' : float [Right ascension in decimal degrees (ICRS/J2000)]
            'dec' : float [Declination in decimal degrees (ICRS/J2000)]
            'Vmag' : float [V magnitude (to 3 decimal places)]
            'name' : str [SIMBAD primary id of star]

        tolerance : float
            The tolerance, in degrees, to be used in determining whether 
            the ra/dec coordinates match.

        Returns
        =======
        name : str
            If match then returns the SIMBAD name. If no match returns 
            an empty string.

        Notes
        =====
        simbad_stars are not all guaranteed to have Vmag. Any that don't are 
        ignored.
        """
        for item in simbad_stars:
            try:
                approx_Vmag = round(item['Vmag'],1)
            except KeyError:
                continue
            if ((vizier_star['_RAJ2000'] > item['ra'] - tolerance) and
                (vizier_star['_RAJ2000'] < item['ra'] + tolerance) and
                (vizier_star['_DEJ2000'] > item['dec'] - tolerance) and
                (vizier_star['_DEJ2000'] < item['dec'] + tolerance) and
                (round(vizier_star['Vmag'],1) == approx_Vmag)):
                return item['name']
        return ''

Some more thoughts after the comments:

The match success is very high (c. 99%) so the loop exits early in almost all cases. It doesn’t have to iterate all of simbad_stars.

I could improve things further if I pre-sort simbad_stars by ra and use a binary chop to get the index of where to start the loop.

Honestly, the best way to speed this kind of code up is generally to change the data structures underlying it -- which is something that can't be done just by changing a single function, but requires modifying how your data is stored to let you index into the dataset and retrieve only a possibly-relevant subset. — Charles Duffy, Jul 16 '19 at 15:54
@Error-SyntacticalRemorse, except it's not permissible in CR as currently asked because it isn't complete enough to be runnable with only the function given. (CR also asks that we close off-topic questions on SO *because they're off-topic here*, rather than because they're more on-topic on CR; see discussion in [A Guide To Code Review For Stack Overflow Users](https://codereview.meta.stackexchange.com/questions/5777/a-guide-to-code-review-for-stack-overflow-users)). — Charles Duffy, Jul 16 '19 at 15:55
@nakb, ...so, some *minor* improvements, you can try looking up your specific `vizier_star` entries before you start the loop over `simbad_stars`, since there's no reason to do the same 4 lookups thousands of times. That's not the kind of orders-of-magnitude fix you're looking for, though. — Charles Duffy, Jul 16 '19 at 15:58
@nakb, ...it's also generally not a great idea to use exceptions in performance-sensitive code; better to use `item.get('Vmag')` and `continue` if the result is `None` rather than depending on exception handling. — Charles Duffy, Jul 16 '19 at 16:00
@nakb, ...it's also always a good idea to actually profile your code to find out where it's spending its time rather than guessing. Maybe you'll find out that the `round()` calls are slow and you're better off doing direct integer comparisons; who knows? (Similarly, since `round(vizier_star['Vmag'],1)` only is going to have one answer, you should calculate it only once, not once per time through the loop). — Charles Duffy, Jul 16 '19 at 16:02
Are both catalogs in memory? So what takes two minutes is calling this once for each vizer_star, for ~(2000 ** 2) lookups, right? For a single star this, with 2000 comparisons, this should be quite fast. If you put the code calling this, with the total ~4.000.000 comparisons, it may be possible to speed things up by converting one of the catalogues to a data-structure indexed by star-coordinates , but with just this part, it is hard to get anything done. — jsbueno, Jul 16 '19 at 16:03

score 2 · Answer 1 · answered Jul 17 '19 at 21:47

2

The question looks like it will be closed because of how it is asked, but there are two useful answers:

(1) for positional cross-matching, see https://docs.astropy.org/en/stable/coordinates/matchsep.html

(2) for the general case of what you're doing here, instead of looping over sources, you should use vectorized operations.

answered Jul 17 '19 at 21:47

keflavich

18,278
20
86
118

Thanks. match_to_catalog_sky looks interesting. – nakb Jul 18 '19 at 07:29

score 0 · Accepted Answer · answered Jul 17 '19 at 21:53

I have managed to achieve a 20x increase in speed by pre-sorting simbad_stars and using bisect_left and bisect_right to define start and stop indices within it.

I can post the code if anyone is interested (it is a fair bit longer than the original as it is a more generalised solution using a custom class).

How could I speed up this function

2 Answers2