I have seen this problem: Sum of Square Differences (SSD) in numpy/scipy
However, I was wondering. My machine has 16-cores and 64GB RAM. Iterating a list of more than a million items in python to first get the prices then calculating the sum squared difference for 1 trading year and saving all the results back in a dataframe, is very time and energy consuming. Can I use multiprocessing to solve speed this process up?
def SSD(list_of_tickers):
start = time.time()
ssd_results = pd.DataFrame(columns=['stock1', 'stock2', 'corr', 'ssd'])
for i in range(len(list_of_tickers)):
price1 = price_pipeline.loc['2015-01-01':'2016-01-01', list_of_tickers[i][0]]
price2 = price_pipeline.loc['2015-01-01':'2016-01-01', list_of_tickers[i][1]]
ssd = np.sum((price1 - price2)**2)
ssd_results.loc[i, 'ssd'] = ssd
ssd_results.loc[i, 'stock1'] = list_of_tickers[i][0]
ssd_results.loc[i, 'stock2'] = list_of_tickers[i][1]
ssd_results.loc[i, 'corr'] = list_of_tickers[i][2]
print('finished running in: {}'.format(time.time() - start))
return ssd_results
These changes thank the discussion with Nick in the comment section significantly improved the speed (810 seconds):
def SSD(list_of_tickers):
start = time.time()
ssd_results = pd.DataFrame(columns=['stock1', 'stock2', 'corr', 'ssd'])
list_ssd = []
list_corr = []
list_stock1 = []
list_stock2 = []
new_pipe = price_pipeline.loc['2015-01-01':'2016-01-01']
for i in range(len(list_of_tickers)):
price1 = new_pipe[list_of_tickers[i][0]]
price2 = new_pipe[list_of_tickers[i][1]]
ssd = np.sum((price1 - price2)**2)
list_ssd.append(ssd)
list_corr.append(list_of_tickers[i][2])
list_stock1.append(list_of_tickers[i][0])
list_stock2.append(list_of_tickers[i][1])
print('finished running in: {}'.format(time.time() - start))
ssd_results['stock1'] = list_stock1
ssd_results['stock2'] = list_stock2
ssd_results['ssd'] = list_ssd
ssd_results['corr'] = list_corr
return ssd_results