0

Amendment:

If I have a pandas DataFrame that includes 5 columns Col1 & Col2 & Col3 & Col4 & Col5 and I need to get max Pearson's correlation coefficient between(Col2,Col3) & (Col2,Col4) & (Col2,Col5) by considering the values in Col1

The modified values for Col2 which obtained by the next formula:

df['Col1']=np.power((df['Col1']),B)
df['Col2']=df['Col2']*df['Col1']

where B is the changing variable (a single value) to get max Pearson's correlation coefficient between (new values of Col2,Col3) & (new values of Col2,Col4) & (new values of Col2,Col5).

Update:

enter image description here

The above table containing 5 columns as I mentioned above, the correlation between coefficient between (Col2,Col3) & (Col2,Col4) & (Col2,Col5) is illustrated below the table.

I need to change the values of Col2 based on two the mentioned equations where the changing value is B.

So the question is how to get the best value of B that gives a new correlation coefficient greater than or equal its counterpart(old)?

enter image description here

Update 2 :

Col1,Col2,Col3,Col4,Col5

2,0.051361397,2618,1453,1099

4,0.053507779,306,153,150

2,0.041236151,39,54,34

6,0.094526419,2755,2209,1947

4,0.079773397,2313,1261,1022

4,0.083891415,3528,2502,2029

6,0.090737243,3594,2781,2508

2,0.069552772,370,234,246

2,0.052401789,690,402,280

2,0.039930675,1218,846,631

4,0.065952096,1706,523,453

2,0.053064126,314,197,123

6,0.076847486,4019,1675,1452

2,0.044881545,604,402,356

2,0.073102611,2214,1263,1050

0,0.046998526,938,648,572

Cleb
  • 25,102
  • 20
  • 116
  • 151
Sidhom
  • 935
  • 1
  • 8
  • 15

1 Answers1

2

Not extremely elegant, but works; feel free to make this more generic:

import pandas as pd
from scipy.optimize import minimize


def minimize_me(b, df):

    # we want to maximize, so we have to multiply by -1
    return -1 * df['Col3'].corr(df['Col2'] * df['Col1'] ** b )

# read your dataframe from somehwere, e.g. csv
df = pd.read_clipboard(sep=',')

# B is greater than 0 for now
bnds = [(0, None)]

res = minimize(minimize_me, (1), args=(df,), bounds=bnds)

if res.success:
    # that's the optimal B
    print(res.x[0])

    # that's the highest correlation you can get
    print(-1 * res.fun)
else:
    print("Sorry, the optimization was not successful. Try with another initial"
          " guess or optimization method")

This will print:

0.9020784246026575 # your B
0.7614993786787415 # highest correlation for corr(col2, col3)

I now read from clipboard, replace that by your .csv file. You should then also avoid the hardcoding of the columns; the code above is just for demonstration purposes, so that you see how to set up the optimization problem itself.

If you are interested in the sum, you can use (rest of code unmodified):

def minimize_me(b, df):

    col_mod = df['Col2'] * df['Col1'] ** b

    # we want to maximize, so we have to multiply by -1
    return -1 * (df['Col3'].corr(col_mod) +
                 df['Col4'].corr(col_mod) +
                 df['Col5'].corr(col_mod))

This will print:

1.0452394748131613
2.3428368479642137
Cleb
  • 25,102
  • 20
  • 116
  • 151
  • Thanks for your answer. but I need single value for B can give the max correlation tho output of your code is the same in the accepted answer in this question '[link](https://stackoverflow.com/questions/55868595/can-python-optimize-a-variable-to-get-max-pearsons-correlation-coefficient )' – Sidhom Apr 28 '19 at 15:24
  • @Sidhom: What do you mean by single value? Should `B` maximize the sum of correlation coefficients of `col2, col3`, `col2, col4` and `col2, col5`? – Cleb Apr 28 '19 at 15:27
  • I mean the output of B should be one value can give the best correlation between the mentioned pairs together. If not there a python method can do that I think your idea about maximizing the sum of correlation coefficients of `col2, col3`, `col2, col4` and `col2, col5` will be enough – Sidhom Apr 28 '19 at 15:32
  • 1
    Thanks a lot. this is the better way to achieve the required. – Sidhom Apr 29 '19 at 02:55