Consolidating panda dataframes: Choose the smaller absolute value

Question

I have two panda DataFrames:

Dataframe A:

date        ticker  return
2017-01-03  CRM     0.018040121229614625
2017-01-03  MSFT    -0.0033444816053511683
2017-01-04  CRM     0.024198086662915008
2017-01-04  MSFT    -0.0028809218950064386
2017-01-05  CRM     -0.0002746875429199269
2017-01-05  MSFT    0.0017687731146487362

Dataframe B:

date        ticker  return
2017-01-03  CRM     0.018040120991250852
2017-01-03  MSFT    -0.003344466975803595
2017-01-04  CRM     0.024198103213211475
2017-01-04  MSFT    -0.0028809268004892363
2017-01-05  CRM     -0.00027464144673694513
2017-01-05  MSFT    0.0017687829680113065

Now I need a third 'consolidated' dataframe:

Same column names
For each row I have to choose the 'return' data from DataframeA or DataframeB with the smaller absolute value

Any suggestions?

[please don't put images of code or data in the question](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question), instead post them as text so people can reproduce your question — anky, Aug 31 '19 at 14:56
thanks , `(pd.concat((A,B),ignore_index=True).groupby(['date','ticker'])['return'] .apply(lambda x: x.abs().min()).reset_index())`? — anky, Aug 31 '19 at 17:37
Almost. You are filling the new panda with the abs value. That's not what I need. I need fill it with dataframeA or dataframeB 'return' column with the smaller absolute value. — Alfonso_MA, Aug 31 '19 at 18:00

Dev Khadka · Answer 1 · 2019-08-31T20:24:28.957

This is Edited New Answer With Runable Code

The following code will work even if number of rows are not equal. It will first take common rows on both dataframes and then finds the correct value for required column

import numpy as np
import pandas as pd

## creating dummy data to get runable code
## ---------------------------------------
n_rows = 20
sub_categories = np.random.choice(4, size=n_rows)
dic1 = {
    "a": list(range(n_rows)),
    "b": sub_categories,
    "c": np.random.randn(n_rows)
}

dic2 = {
    "a": range(n_rows),
    "b": sub_categories,
    "c": np.random.randn(n_rows)
}

df1 = pd.DataFrame(dic1)
df1.drop(index=list(np.random.choice(n_rows, 5, replace=False)), inplace=True)

df2 = pd.DataFrame(dic2)
df2.drop(index=list(np.random.choice(n_rows, 3, replace=False)), inplace=True)




## Main Answer
## ---------------------------------------------------------


## merge df1 and df2 then create new column c based which take min(abs(c_1, c_2))

result = df1.merge(df2, how="inner", on=["a","b"], suffixes=["_1", "_2"]).copy()
result["c"] = result["c_1"].where(np.abs(result["c_1"])<np.abs(result["c_2"]), 
                                   result["c_2"], axis=0)
display(result)

## finally reindex to remove extra columns
result = result.reindex(columns=["a","b","c"])
result

Old Answer

you can do something like below

series = df1["return"].where(np.abs(df1["return"])<np.abs(df2["return"]), df2["return"], axis=0)
series

it will return series whose value in a row is taken from df1 if absolute value of return is less than same row in df2 else it will take value from df2

then you can replace column of either df1 or df2 or their copy to get dataframe you want

df1["return"] = series

It is working if I have the same number of rows in dataframeA and dataframeB. How to get it If I would have some rows in dataframeA that not exists in dataframeB? Thanks a lot — Alfonso_MA, Aug 31 '19 at 17:04
@user2132478 Plz check edited answer above, now it should work even with different number of rows, :) don't forget to mark resolved if it works — Dev Khadka, Sep 01 '19 at 06:57

score 1 · Answer 2 · answered Aug 31 '19 at 15:04

1

you can use concat to join both dataframe, then group by ticker using groupby and get the minimum of each group:

df3=pd.concat([df1,df2]).groupby('ticker').min().reset_index()

answered Aug 31 '19 at 15:04

ansev

30,322
5
17
31

You have to include date also in groupby. Also, OP wants absolute minimum. – moys Aug 31 '19 at 16:15

score 0 · Accepted Answer · answered Sep 01 '19 at 12:12

0

Try concat+groupby on return and return the min with key=abs :

(pd.concat((A,B),ignore_index=True)
   .groupby(['date','ticker'])['return'].min(key=abs).reset_index())

         date ticker    return
0  2017-01-03    CRM  0.018040
1  2017-01-03   MSFT -0.003344
2  2017-01-04    CRM  0.024198
3  2017-01-04   MSFT -0.002881
4  2017-01-05    CRM -0.000275
5  2017-01-05   MSFT  0.001769

answered Sep 01 '19 at 12:12

anky

74,114
11
41
70

If I would have a second column, Do I have to repeat the operation for each column? Or it can be done in the same line? – Alfonso_MA Sep 01 '19 at 22:22

Consolidating panda dataframes: Choose the smaller absolute value

3 Answers3