1

I have two panda DataFrames:

Dataframe A:

date        ticker  return
2017-01-03  CRM     0.018040121229614625
2017-01-03  MSFT    -0.0033444816053511683
2017-01-04  CRM     0.024198086662915008
2017-01-04  MSFT    -0.0028809218950064386
2017-01-05  CRM     -0.0002746875429199269
2017-01-05  MSFT    0.0017687731146487362

Dataframe B:

date        ticker  return
2017-01-03  CRM     0.018040120991250852
2017-01-03  MSFT    -0.003344466975803595
2017-01-04  CRM     0.024198103213211475
2017-01-04  MSFT    -0.0028809268004892363
2017-01-05  CRM     -0.00027464144673694513
2017-01-05  MSFT    0.0017687829680113065

Now I need a third 'consolidated' dataframe:

  • Same column names
  • For each row I have to choose the 'return' data from DataframeA or DataframeB with the smaller absolute value

Any suggestions?

Alfonso_MA
  • 537
  • 5
  • 26
  • 3
    [please don't put images of code or data in the question](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question), instead post them as text so people can reproduce your question – anky Aug 31 '19 at 14:56
  • I have just correct that – Alfonso_MA Aug 31 '19 at 17:31
  • thanks , `(pd.concat((A,B),ignore_index=True).groupby(['date','ticker'])['return'] .apply(lambda x: x.abs().min()).reset_index())`? – anky Aug 31 '19 at 17:37
  • Almost. You are filling the new panda with the abs value. That's not what I need. I need fill it with dataframeA or dataframeB 'return' column with the smaller absolute value. – Alfonso_MA Aug 31 '19 at 18:00

3 Answers3

2

This is Edited New Answer With Runable Code

The following code will work even if number of rows are not equal. It will first take common rows on both dataframes and then finds the correct value for required column

import numpy as np
import pandas as pd

## creating dummy data to get runable code
## ---------------------------------------
n_rows = 20
sub_categories = np.random.choice(4, size=n_rows)
dic1 = {
    "a": list(range(n_rows)),
    "b": sub_categories,
    "c": np.random.randn(n_rows)
}

dic2 = {
    "a": range(n_rows),
    "b": sub_categories,
    "c": np.random.randn(n_rows)
}

df1 = pd.DataFrame(dic1)
df1.drop(index=list(np.random.choice(n_rows, 5, replace=False)), inplace=True)

df2 = pd.DataFrame(dic2)
df2.drop(index=list(np.random.choice(n_rows, 3, replace=False)), inplace=True)




## Main Answer
## ---------------------------------------------------------


## merge df1 and df2 then create new column c based which take min(abs(c_1, c_2))

result = df1.merge(df2, how="inner", on=["a","b"], suffixes=["_1", "_2"]).copy()
result["c"] = result["c_1"].where(np.abs(result["c_1"])<np.abs(result["c_2"]), 
                                   result["c_2"], axis=0)
display(result)

## finally reindex to remove extra columns
result = result.reindex(columns=["a","b","c"])
result


Old Answer

you can do something like below

series = df1["return"].where(np.abs(df1["return"])<np.abs(df2["return"]), df2["return"], axis=0)
series

it will return series whose value in a row is taken from df1 if absolute value of return is less than same row in df2 else it will take value from df2

then you can replace column of either df1 or df2 or their copy to get dataframe you want

df1["return"] = series
Dev Khadka
  • 5,142
  • 4
  • 19
  • 33
  • You can do `df1["return"]=np.where(np.abs(df1["return"]) – moys Aug 31 '19 at 16:20
  • ya thats right, I just broke that in two steps for clarity – Dev Khadka Aug 31 '19 at 16:23
  • It is working if I have the same number of rows in dataframeA and dataframeB. How to get it If I would have some rows in dataframeA that not exists in dataframeB? Thanks a lot – Alfonso_MA Aug 31 '19 at 17:04
  • @user2132478 Plz check edited answer above, now it should work even with different number of rows, :) don't forget to mark resolved if it works – Dev Khadka Sep 01 '19 at 06:57
1

you can use concat to join both dataframe, then group by ticker using groupby and get the minimum of each group:

df3=pd.concat([df1,df2]).groupby('ticker').min().reset_index()
ansev
  • 30,322
  • 5
  • 17
  • 31
0

Try concat+groupby on return and return the min with key=abs :

(pd.concat((A,B),ignore_index=True)
   .groupby(['date','ticker'])['return'].min(key=abs).reset_index())

         date ticker    return
0  2017-01-03    CRM  0.018040
1  2017-01-03   MSFT -0.003344
2  2017-01-04    CRM  0.024198
3  2017-01-04   MSFT -0.002881
4  2017-01-05    CRM -0.000275
5  2017-01-05   MSFT  0.001769
anky
  • 74,114
  • 11
  • 41
  • 70
  • If I would have a second column, Do I have to repeat the operation for each column? Or it can be done in the same line? – Alfonso_MA Sep 01 '19 at 22:22