3

I'm making my way around GroupBy, but I still need some help. Let's say that I've a DataFrame with columns Group, giving objects group number, some parameter R and spherical coordinates RA and Dec. Here is a mock DataFrame:

df = pd.DataFrame({ 
    'R'    : (-21.0,-21.5,-22.1,-23.7,-23.8,-20.4,-21.8,-19.3,-22.5,-24.7,-19.9),
    'RA': (154.362789,154.409301,154.419191,154.474165,154.424842,162.568516,8.355454,8.346812,8.728223,8.759622,8.799796),
    'Dec': (-0.495605,-0.453085,-0.481657,-0.614827,-0.584243,8.214719,8.355454,8.346812,8.728223,8.759622,8.799796),
    'Group': (1,1,1,1,1,2,2,2,2,2,2) 
})

I want to built a selection containing for each group the "brightest" object, i.e. the one with the smallest R (or the greatest absolute value, since Ris negative) and the 3 closest objects of the group (so I keep 4 objects in each group - we can assume that there is no group smaller than 4 objects if needed).

We assume here that we have defined the following functions:

#deg to rad
def d2r(x):
    return x * np.pi / 180.0

#rad to deg
def r2d(x):
    return x * 180.0 / np.pi

#Computes separation on a sphere
def calc_sep(phi1,theta1,phi2,theta2):
    return np.arccos(np.sin(theta1)*np.sin(theta2) + 
                     np.cos(theta1)*np.cos(theta2)*np.cos(phi2 - phi1) )

and that separation between two objects is given by r2d(calc_sep(RA1,Dec1,RA2,Dec2)), with RA1 as RA for the first object, and so on.

I can't figure out how to use GroupBy to achieve this...

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
Matt
  • 763
  • 1
  • 7
  • 25
  • Can you provide an example of what you expect to get? Do you want to sort objects inside a group? – igrinis Sep 21 '17 at 15:28

2 Answers2

2

What you can do here is build a more specific helper function that gets applied to each "sub-frame" (each group).

GroupBy is really just a facility that creates something like an iterator of (group id, DataFrame) pairs, and a function is applied to each of these when you call .groupby().apply. (That glazes over a lot of details, see here for some details on internals if you're interested.)

So after defining your three NumPy-based functions, also define:

def sep_df(df, keep=3):
    min_r = df.loc[df.R.argmin()]
    RA1, Dec1 = min_r.RA, min_r.Dec
    sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
    idx = sep.nsmallest(keep+1).index
    return df.loc[idx]

Then just apply and you get a MultiIndex DataFrame where the first index level is the group.

print(df.groupby('Group').apply(sep_df))
              Dec  Group     R         RA
Group                                    
1     3  -0.61483      1 -23.7  154.47416
      2  -0.48166      1 -22.1  154.41919
      0  -0.49561      1 -21.0  154.36279
      4  -0.58424      1 -23.8  154.42484
2     8   8.72822      2 -22.5    8.72822
      10  8.79980      2 -19.9    8.79980
      6   8.35545      2 -21.8    8.35545
      9   8.75962      2 -24.7    8.75962

With some comments interspersed:

def sep_df(df, keep=3):
    # Applied to each sub-Dataframe (this is what GroupBy does under the hood)

    # Get RA and Dec values at minimum R
    min_r = df.loc[df.R.argmin()]  # Series - row at which R is minimum
    RA1, Dec1 = min_r.RA, min_r.Dec  # Relevant 2 scalars within this row

    # Calculate separation for each pair including minimum R row
    # The result is a series of separations, same length as `df`
    sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))

    # Get index values of `keep` (default 3) smallest results
    # Retain `keep+1` values because one will be the minimum R
    # row where separation=0
    idx = sep.nsmallest(keep+1).index

    # Restrict the result to those 3 index labels + your minimum R
    return df.loc[idx]

For speed, consider passing sort=False to GroupBy if the result still works for you.

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • Marvelous! It is what I was looking for. Thanks mate. I begin to understand how it works. I'll award you the bounty when available (it says in 22 hours). Just a question: when I do that, I have a `RuntimeWarning: invalid value encountered in arccos`, how do I know which lines cause the problem? – Matt Sep 21 '17 at 16:31
  • 1
    There are a couple of reasons I could think of: one is just that you actually have data being represented as strings when it shouldn't be. Try `df.dtypes` or see [this](https://stackoverflow.com/questions/26660313/pandas-location-of-a-row-with-error)--do any say "object"? --> Then convert to numeric with `pd.to_numeric`. The other suspect is that arccos(1.041516417) is undefined, but I doubt that would be the issue. Hard to say without seeing your full dataset. – Brad Solomon Sep 21 '17 at 16:35
  • Thank you for your answer. No, `RA` and `Dec` are float64. Yes, a cosine is always between -1 and 1, so arccos(1.041516417) is undefined, but why would this value occur? – Matt Sep 24 '17 at 07:40
  • (I've rewarded you with the bounty, by the way. :) ) – Matt Sep 24 '17 at 07:40
2

I want to built a selection containing for each group the "brightest" object...and the 3 closest objects of the group

step 1:

create a dataframe for the brightest object in each group

maxR = df.sort_values('R').groupby('Group')['Group', 'Dec', 'RA'].head(1)

step 2:

merge the two frames on Group & calculate the separation

merged = df.merge(maxR, on = 'Group', suffixes=['', '_max'])
merged['sep'] = merged.apply(
    lambda x: r2d(calc_sep(x.RA, x.Dec, x.RA_max, x.Dec_max)), 
    axis=1
)

step 3:

order the data frame, group by 'Group', (optional) discard intermediate fields & take the first 4 rows from each group

finaldf = merged.sort_values(['Group', 'sep'], ascending=[1,1]
).groupby('Group')[df.columns].head(4)

Produces the following data frame with your sample data:

         Dec  Group     R          RA
4  -0.584243      1 -23.8  154.424842
3  -0.614827      1 -23.7  154.474165
2  -0.481657      1 -22.1  154.419191
0  -0.495605      1 -21.0  154.362789
9   8.759622      2 -24.7    8.759622
8   8.728223      2 -22.5    8.728223
10  8.799796      2 -19.9    8.799796
6   8.355454      2 -21.8    8.355454
Haleemur Ali
  • 26,718
  • 5
  • 61
  • 85