Sum squared difference for a list pairs with over 1.5 million items

Question

I have seen this problem: Sum of Square Differences (SSD) in numpy/scipy

However, I was wondering. My machine has 16-cores and 64GB RAM. Iterating a list of more than a million items in python to first get the prices then calculating the sum squared difference for 1 trading year and saving all the results back in a dataframe, is very time and energy consuming. Can I use multiprocessing to solve speed this process up?


def SSD(list_of_tickers):
    start = time.time()
    ssd_results = pd.DataFrame(columns=['stock1', 'stock2', 'corr', 'ssd'])
    for i in range(len(list_of_tickers)):
        price1 = price_pipeline.loc['2015-01-01':'2016-01-01', list_of_tickers[i][0]]
        price2 = price_pipeline.loc['2015-01-01':'2016-01-01', list_of_tickers[i][1]]
        ssd = np.sum((price1 - price2)**2)
        ssd_results.loc[i, 'ssd'] = ssd
        ssd_results.loc[i, 'stock1'] = list_of_tickers[i][0]
        ssd_results.loc[i, 'stock2'] = list_of_tickers[i][1]
        ssd_results.loc[i, 'corr'] = list_of_tickers[i][2]

    print('finished running in: {}'.format(time.time() - start))
    return ssd_results

These changes thank the discussion with Nick in the comment section significantly improved the speed (810 seconds):


def SSD(list_of_tickers):
    start = time.time()
    ssd_results = pd.DataFrame(columns=['stock1', 'stock2', 'corr', 'ssd'])
    list_ssd = []
    list_corr = []
    list_stock1 = []
    list_stock2 = []
    new_pipe = price_pipeline.loc['2015-01-01':'2016-01-01']
    for i in range(len(list_of_tickers)):
        price1 = new_pipe[list_of_tickers[i][0]]
        price2 = new_pipe[list_of_tickers[i][1]]
        ssd = np.sum((price1 - price2)**2)

        list_ssd.append(ssd)
        list_corr.append(list_of_tickers[i][2])
        list_stock1.append(list_of_tickers[i][0])
        list_stock2.append(list_of_tickers[i][1])

    print('finished running in: {}'.format(time.time() - start))
    ssd_results['stock1'] = list_stock1
    ssd_results['stock2'] = list_stock2
    ssd_results['ssd'] = list_ssd
    ssd_results['corr'] = list_corr
    
    return ssd_results

Your computer runs over a billion instructions per second, so doing a sum of square differences for a million entries will take around a millisecond. At that size, your program would spend more time in multiprocessing overhead than actually doing the calculation. Seems not worth bothering with, unless you're doing it for learning purposes. — Nick ODell, Jan 08 '22 at 20:57
How does that happen? So there is no way in speeding up this process? I would like to do for education purpose. — cem, Jan 08 '22 at 21:04
Iterating the list is fast. Creating the *new* list (or other data structure) of results, which will require quite a bit of memory allocation, is the slow part. — chepner, Jan 08 '22 at 21:06
@NickODell It does not take a milisecond, were I to run this program using the full list, it would take a day. — cem, Jan 08 '22 at 21:07
@NickODell please check the latest edit. Whats your opinion? — cem, Jan 08 '22 at 21:13
I suspect part of it is the `.loc`. That's repeatedly doing a slicing operation on the dataframe, and I think you could do it before the loop. — Nick ODell, Jan 08 '22 at 21:17
@NickODell so good news! I made some changes, I did the slicing before the loop, calling it new_pipeline. Then used lists to store all the attributes then storing everything to the dataframe at the end. 100.000 runs in 50 seconds rather than what would have been maybe an hour. But the next step is, how can we make this go faster? Multi-processing? — cem, Jan 08 '22 at 21:33
In your second version you don't seem to use `i` for anything other than `list_of_tickers[i]`, so why not just iterate the list directly? — Kelly Bundy, Jan 08 '22 at 21:42
list_of_tickers is a list of 1.5million + items that consist of a tuple (ticker 1, ticker 2, correlation). Then in the for loop you it takes each item, get price for each ticker and then compute the sum squared difference. Append all the list and once running is finished to store everything to a dataframe. — cem, Jan 08 '22 at 21:49
@cem Are you interested in the pairwise SSE of *every* stock, or just a subset of them? If the answer is the former, I would point out that Scipy may be a faster approach. The function [scipy.spatial.distance.pdist()](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html) with metric set to 'sqeuclidean' is equivalent to an SSE, and can calculate all distances at once. — Nick ODell, Jan 08 '22 at 22:06
It is the former, Ill check it out thank you for your help - but is there a way to use multiprocessing ? — cem, Jan 08 '22 at 22:16
@cem Look into [Pool.map()](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.map) in the multiprocessing module. IMO, it's the easiest and best way to write multiprocessing code. — Nick ODell, Jan 09 '22 at 04:01

score 1 · Answer 1 · answered Apr 03 '22 at 12:11

1

You should not try to reinvent the wheel. Instead, use vectorized calculations by using numpy. It works very well when doing calculations on vectors, here the columns of your dataframe.

answered Apr 03 '22 at 12:11

Benjamin Rio

652
2
17

Yes it does improve the speed. – cem Apr 04 '22 at 15:52

Sum squared difference for a list pairs with over 1.5 million items

1 Answers1