0

I have the following dataframe:

import pandas as pd
import numpy as np

base_df = pd.DataFrame({
    'id': [1,2,3],
    'base_value': [100, None, 123.1]
})

id  base_value
1   100.0
2   NaN
3   123.1

There's also another one:

extended_df = pd.DataFrame({
    'id': [1, 1, 2, 2, 3],
    'ext_id': [100, 500, 90, 1, 1000],
    'role_1': [True, False, False, True, True],
    'role_2': [False, True, True, False, False],
    'ext_value': [10, 5, 21, 200, 500]
})

id  ext_id  role_1  role_2  ext_value
1   100     True    False   10
1   500     False   True    5
2   90      False   True    21
2   1       True    False   200
3   1000    True    False   10

I would like to get the following dataframe as a result:

result_df = pd.DataFrame({
    'id': [1,2,3],
    'base_value': [100, None, 123.1],
    'ext_value_role_1': [10, 200, 500],
    'ext_value_role_2' :[5, 21, None]
})

id  base_value  ext_value_role_1    ext_value_role_2
1   100.0       10                  5.0
2   NaN         200                 21.0
3   123.1       500                 NaN

The logic behind the transformation is as follows. For each row in base_df find matches in extended_df. The resulting dataframe will contain as many additional columns (excluding id, base_value) as there are role_* columns times number of ext_value columns in extended_df. If id's role is set to False the resulting value in column is None.

I've come up with the code posted below. It works for the small example but I'm hoping to get something more pandasesque.

def multiply_by_boolean(boolean_value, variable):
    if boolean_value is False:
        return None
    elif boolean_value is None:
        raise ValueError('boolean_value cannot be None')
    return variable

extended_df['ext_value_role_1'] = extended_df.apply(lambda x: multiply_by_boolean(x['role_1'], x['ext_value']), axis=1)
extended_df['ext_value_role_2'] = extended_df.apply(lambda x: multiply_by_boolean(x['role_2'], x['ext_value']), axis=1)

res_df = extended_df[['id', 'ext_value_role_1', 'ext_value_role_2']]

res_df = res_df.groupby('id').agg('max')

res_df = res_df.merge(right=base_df, left_index=True, right_on='id')
balkon16
  • 1,338
  • 4
  • 20
  • 40

1 Answers1

1

You could use merge to build a temporary dataframe and then just add the relevant columns:

tmp =base_df.merge(extended_df, on='id')

resul_df = base_df.set_index('id')

for role in ['role_1', 'role_2']:
    resul_df['ext_value_' + role] = tmp.loc[
        tmp[role] == True, ['id', 'ext_value']].set_index('id')

resul_df.reset_index(inplace=True)

It gives as expected:

   id  base_value  ext_value_role_1  ext_value_role_2
0   1       100.0                10               5.0
1   2         NaN               200              21.0
2   3       123.1               500               NaN
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252