3

In the context of trying to plot the YoY correlation of a DataFrame in Python. The question is how does one get the 3 pair-wise correlation coefficients representing each pair of the variables "AAPL", "IBM" and "MSFT" correlation each year. Then plot them with matplotlib.

How does one calculate a correlation by row? .corrwith seems to be whats suggested but it it not working here.

https://www.geeksforgeeks.org/python-pandas-dataframe-corrwith/

I managed to get to a pandas DataFrame where each row represents the year and each element represents the cumulative price over the year. I would like to take the correlations of the cumulative YoY prices then plot them as a function of time.

The data looks like:

             AAPL           IBM         MSFT
Year                                        
2003   333.392142  21429.009979  6585.475002
2004   637.586428  22862.419960  6837.309986
2005  1678.695713  21121.199997  6519.779993
2006  2545.412858  20827.630028  6592.800003
2007  4603.665710  26528.350021  7638.409990
2008  5143.625731  27841.030014  6755.059990
2009  5278.287136  27444.059998  5779.759998
2010  9312.338573  33034.919891  6795.050001

The final plot is meant to look like this,

enter image description here

To summarize the question: How does one take the following data, calculate the 3 pairwise correlations for each year and then use matplotlib in order to plot the results?

The code to import the data and manipulate it so far is provided below. Note yfinance was used to load the data,

#!pip install yfinance
import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
ticker_Symbol = "AAPL", "MSFT", "IBM"
start_date = '2003-1-01'
end_date =  '2010-12-31'

df5 = yf.download(ticker_Symbol,start_date , end_date)
df  = df5[["Open"]]

print(df.head(3))

# Index the Year of each Value
df["Year"] = df.index.year
dfYearly = df.groupby(['Year']).sum()
dfYearly = dfYearly["Open"] 
dfYearly
user4933
  • 1,485
  • 3
  • 24
  • 42

1 Answers1

0

You cannot calculate a correlation between two single numbers.

The idea behind calculating a correlation coefficient is that there is an underlying "population" correlation coefficient that you estimate by calculating the empirical coefficient for a data sample. But if the size of that sample is 1, you have zero information about any potential correlation.

So if you want to calculate separate correlation coefficients for individual years, you will need data that is not already aggregated by year. Then you could in fact use corrwith as the aggregation method per year.

Arne
  • 9,990
  • 2
  • 18
  • 28
  • Hi @Arne. I get what you are saying about the correlation, thanks I never actually thought about that. I am strggling to understand the implementation of what you are saying though how would I "use corrwith as the aggregation method per year". – user4933 May 18 '20 at 11:36
  • Hi @Thamu Mnyulwa. Don't read too much into that. I just meant that when you have several numbers per variable per year, calculating correlation coefficients is one way to summarize the data per year. – Arne May 18 '20 at 20:03