Find closest column value in each row - pandas

Question

Here is a sample of a larger data set:

df_old = pd.DataFrame({'code': ['fea-1','fea-132','fea-223','fea-394','fea-595','fea-130','fea-495'],
                   'forecastWind_low':[20,15,0,45,45,25,45],
                   'forecastWind_high':['NaN' ,30,'NaN',55,65,35,'NaN'],
                   'obs_windSpeed':[20,11,3,65,55,'NaN',55]})

I have forecast windspeeds which I need to compare to observations... Ultimately I need to find the closest forecast speed (low or high) to the observation wind speed value to get an output as below:

df_new = pd.DataFrame({'code': ['fea-1','fea-132','fea-223','fea-394','fea-595','fea-130','fea-495'],
                   'forecastWind_low':[20,15,0,45,45,25,45],
                   'forecastWind_high':['NaN' ,30,'NaN',55,65,35,'NaN'],
                   'obs_windSpeed':[20,11,3,65,55,'NaN',55],
                   'nearest_forecast_windSpeed':[20,15,0,55,45,'NaN',45]})

Tried following this example (https://stackoverflow.com/questions/53969800/find-nearest-value-from-multiple-columns-and-add-to-a-new-column-in-python) but i think the addtional non-related columns and NaNs/error strings in some rows are the issue... — Jordan Ford, Sep 04 '20 at 00:54
Important you tell us what you have tried doing. Piece of code usually useful. `df=df_old.fillna(0)` and then `df['nearest_forecast_windSpeed']=np.where(df.obs_windSpeed.sub(df.forecastWind_low) — wwnde, Sep 04 '20 at 01:00

score 1 · Accepted Answer · answered Sep 04 '20 at 01:22

1

Make a custom comparison function and apply it across the rows

def check_speed_diff(high,low,obs):
    if np.isnan(obs):
        return np.nan
    elif np.isnan(high):
        return low
    elif np.isnan(low):
        return high
    
    if abs(high-obs)<abs(low-obs):
        return high
    else:
        return low

df_old.apply(lambda x: 
    check_speed_diff(
        x.forecastWind_high,
        x.forecastWind_low,
        x.obs_windSpeed
    ),
    axis=1
)

answered Sep 04 '20 at 01:22

Jeff

610
4
12

if your `Nan`s are actually strings and not `np.nan` check whether the 3 values are strings or not instead of whether they're `np.nan` in the beginning of the comparison function – Jeff Sep 04 '20 at 01:24
Thanks. Your response really helped me figure this out. – Jordan Ford Sep 04 '20 at 15:26

rhug123 · Answer 2 · 2023-05-27T23:41:02.307

Try this:

df1 = df[['forecastWind_low','forecastWind_high']]
s1 = df['obs_windSpeed']

idx,cols = pd.factorize(df1.sub(s1,axis=0).abs().idxmin(axis=1))

df.assign(nearest_forecast_windSpeed = np.where(s1.notna(),df1.reindex(cols,axis=1).to_numpy()[range(len(df)),idx],np.NaN))

Old answer:

Here is another way to achieve what you are looking for. It allows for more than just two columns to be compared.

col = ['forecastWind_low','forecastWind_high']
comparecol = ['obs_windSpeed']
df[col + comparecol] = df[col + comparecol].astype(float)
dfmerge =pd.merge(df[col].stack().reset_index(-1),df[comparecol],left_index=True,right_index=True,how='left')
dfmerge = dfmerge.rename(columns = {'level_1':'windforecast',0:'Amount'})
dfmerge['difference'] = abs(dfmerge['obs_windSpeed'] - dfmerge['Amount'])
dfmerge = dfmerge.sort_values(by='difference',ascending=True)
dfmerge = dfmerge.groupby(level=0).head(1)
df = pd.merge(df,dfmerge['Amount'],left_index=True,right_index=True,how='left')
df.loc[df['obs_windSpeed'].isna(),'Amount'] = np.nan

score 1 · Answer 3 · answered Sep 04 '20 at 15:25

Modifying Jeff's solution I managed to come up with this:

def check_speed_diff(high,low,obs):
    if obs == 'NaN':
        return np.nan
    if low != 'NaN' and high == 'NaN':
        return low
    if low == 'NaN' and high != 'NaN':
        return high
    if low != 'NaN' and high != 'NaN':
        if abs(high-obs)<abs(low-obs):
            return high
        else:
            return low

Another problem I was encountering was the strings in some columns/rows that were not 'NaN', so I used pandas and coerced the errors:

df.forecast_WindSpeed_high = pd.to_numeric(df.forecast_WindSpeed_high,errors='coerce')
df.forecast_WindSpeed_low = pd.to_numeric(df.forecast_WindSpeed_low ,errors='coerce')

Applied function using Jeff's suggestion:

df['nearest_forecastWindSpeed'] = df.apply(lambda x: check_speed_diff(
        x.forecast_WindSpeed_high, 
        x.forecast_WindSpeed_low,
        x.windSpeed),axis=1)

Might not be the most efficient but I got the job done... Thanks everyone for the help.

Find closest column value in each row - pandas

3 Answers3