-3

I am downloading a star catalog from Vizier (using astroquery). The catalog concerned does not include star names so I am getting these from SIMBAD (also using astroquery) by querying all SIMBAD stars within 1 arcsec of each of my Vizier catalog stars.

I need to then perform a match by ra/dec coordinates. However both the Vizier and SIMBAD coordinates may be slightly inaccurate so I can't do an exact match.

My current solution is to specify a tolerance and, for each Vizier star, call the function below to loop through the SIMBAD stars, testing whether the coordinates agree within the specified tolerance. As a double-check, because stars can be very close together, I also check whether the star magnitudes match to within 0.1 mag.

This all works but for a Vizier catalog of c.2,000 stars and a SIMBAD dataset of similar size it takes over 2 minutes to run. I'm looking for ideas to speed this up.

    def get_simbad_name(self, vizier_star, simbad_stars, tolerance):
        """
        Searches simbad_stars to find the SIMBAD name of the star 
        referenced in vizier_star.

        A match is deemed to exist if a star in simbad_stars has both 
        ra and dec +/- tolerance of the target vizier_star and if their V 
        magnitudes, rounded to one decimal place, also match.

        Parameters
        ==========
        vizier_star : astropy.table.Row
            Row of results from Vizier query, corresponding to a star in a 
            Vizier catalog. Columns of interest to this function are:

            '_RAJ2000' : float [Right ascension in decimal degrees]
            '_DEJ2000' : float [Declination in decimal degrees]
            'Vmag' : float [V magnitude (to 3 decimal places)]

        simbad_stars : list of dict
            List of star data derived from a Vizier query. Keys of interest 
            to this function are:

            'ra' : float [Right ascension in decimal degrees (ICRS/J2000)]
            'dec' : float [Declination in decimal degrees (ICRS/J2000)]
            'Vmag' : float [V magnitude (to 3 decimal places)]
            'name' : str [SIMBAD primary id of star]

        tolerance : float
            The tolerance, in degrees, to be used in determining whether 
            the ra/dec coordinates match.

        Returns
        =======
        name : str
            If match then returns the SIMBAD name. If no match returns 
            an empty string.

        Notes
        =====
        simbad_stars are not all guaranteed to have Vmag. Any that don't are 
        ignored.
        """
        for item in simbad_stars:
            try:
                approx_Vmag = round(item['Vmag'],1)
            except KeyError:
                continue
            if ((vizier_star['_RAJ2000'] > item['ra'] - tolerance) and
                (vizier_star['_RAJ2000'] < item['ra'] + tolerance) and
                (vizier_star['_DEJ2000'] > item['dec'] - tolerance) and
                (vizier_star['_DEJ2000'] < item['dec'] + tolerance) and
                (round(vizier_star['Vmag'],1) == approx_Vmag)):
                return item['name']
        return ''

Some more thoughts after the comments:

The match success is very high (c. 99%) so the loop exits early in almost all cases. It doesn’t have to iterate all of simbad_stars.

I could improve things further if I pre-sort simbad_stars by ra and use a binary chop to get the index of where to start the loop.

nakb
  • 321
  • 1
  • 3
  • 16
  • Honestly, the best way to speed this kind of code up is generally to change the data structures underlying it -- which is something that can't be done just by changing a single function, but requires modifying how your data is stored to let you index into the dataset and retrieve only a possibly-relevant subset. – Charles Duffy Jul 16 '19 at 15:54
  • The should be a code review since it works. – Error - Syntactical Remorse Jul 16 '19 at 15:54
  • 2
    @Error-SyntacticalRemorse, except it's not permissible in CR as currently asked because it isn't complete enough to be runnable with only the function given. (CR also asks that we close off-topic questions on SO *because they're off-topic here*, rather than because they're more on-topic on CR; see discussion in [A Guide To Code Review For Stack Overflow Users](https://codereview.meta.stackexchange.com/questions/5777/a-guide-to-code-review-for-stack-overflow-users)). – Charles Duffy Jul 16 '19 at 15:55
  • @nakb, ...so, some *minor* improvements, you can try looking up your specific `vizier_star` entries before you start the loop over `simbad_stars`, since there's no reason to do the same 4 lookups thousands of times. That's not the kind of orders-of-magnitude fix you're looking for, though. – Charles Duffy Jul 16 '19 at 15:58
  • @nakb, ...it's also generally not a great idea to use exceptions in performance-sensitive code; better to use `item.get('Vmag')` and `continue` if the result is `None` rather than depending on exception handling. – Charles Duffy Jul 16 '19 at 16:00
  • @nakb, ...it's also always a good idea to actually profile your code to find out where it's spending its time rather than guessing. Maybe you'll find out that the `round()` calls are slow and you're better off doing direct integer comparisons; who knows? (Similarly, since `round(vizier_star['Vmag'],1)` only is going to have one answer, you should calculate it only once, not once per time through the loop). – Charles Duffy Jul 16 '19 at 16:02
  • 1
    Are both catalogs in memory? So what takes two minutes is calling this once for each vizer_star, for ~(2000 ** 2) lookups, right? For a single star this, with 2000 comparisons, this should be quite fast. If you put the code calling this, with the total ~4.000.000 comparisons, it may be possible to speed things up by converting one of the catalogues to a data-structure indexed by star-coordinates , but with just this part, it is hard to get anything done. – jsbueno Jul 16 '19 at 16:03

2 Answers2

2

The question looks like it will be closed because of how it is asked, but there are two useful answers:

(1) for positional cross-matching, see https://docs.astropy.org/en/stable/coordinates/matchsep.html

(2) for the general case of what you're doing here, instead of looping over sources, you should use vectorized operations.

keflavich
  • 18,278
  • 20
  • 86
  • 118
0

I have managed to achieve a 20x increase in speed by pre-sorting simbad_stars and using bisect_left and bisect_right to define start and stop indices within it.

I can post the code if anyone is interested (it is a fair bit longer than the original as it is a more generalised solution using a custom class).

nakb
  • 321
  • 1
  • 3
  • 16