Optimising iteration and substitution over large dataset

Question

I've made a post here, yet as I got no answer as per now I thought maybe to try it also here as I've found it relevant.

I have the following code:

import pandas as pd
import numpy as np
import itertools 
from pprint import pprint

# Importing the data
df=pd.read_csv('./GPr.csv', sep=',',header=None)
data=df.values
res = np.array([[i for i in row if i == i] for row in data.tolist()], dtype=object)

# This function will make the subsets of a list 
def subsets(m,n):
    z = []
    for i in m:
        z.append(list(itertools.combinations(i, n)))
    return(z)

# Make the subsets of size 2 
l=subsets(res,2)
l=[val for sublist in l for val in sublist]
Pairs=list(dict.fromkeys(l)) 

# Modify the pairs: 
mod=[':'.join(x) for x in Pairs]

# Define new lists
t0=res.tolist()
t0=map(tuple,t0)
t1=Pairs
t2=mod

# Make substitions
result = []
for v1, v2 in zip(t1, t2):
    out = []
    for i in t0:
        common = set(v1).intersection(i)
        if set(v1) == common:
            out.append(tuple(list(set(i) - common) + [v2]))
        else:
            out.append(tuple(i))
    result.append(out)

pprint(result, width=200)  

# Delete duplicates
d = {tuple(x): x for x in result} 
remain= list(d.values())

What it does is as follows: First, we import the csv file we want to work with in here. You can see that it is a list of elements, for each element we find the subsets of size two. We then write a modification to the subsets and call it mod. What it does is to take say ('a','b') and convert it to 'a:b'. We then, for each pair, go through the original data and where ever we find the pairs we substitute them. Finally we delete all the duplicates as it is given.

The code works fine for small set of data. Yet the problem is that the file I have, has 30082 pairs where for each the list of ~49000 list should be scanned and pairs being replaced. I run this in Jupyter and after some time the Kernel dies. I wonder how one can optimise this?

what is expected result for say firs line of you data: "chicken cinnamon ginger onion soy_sauce" ? — Zaraki Kenpachi, Jul 10 '19 at 09:17
Yes. It would be as `[('chicken','cinnamon', 'ginger', 'onion', 'soy_sauce'),...]` where `...` denotes the rest of the data in similar format. — Wiliam, Jul 10 '19 at 09:53

score 1 · Accepted Answer · answered Jul 10 '19 at 11:12

Tested on entire file.

Here You go:

=^..^=

import pandas as pd
import numpy as np
import itertools

# Importing the data
df=pd.read_csv('./GPr_test.csv', sep=',',header=None)

# set new data frame
df2 = pd.DataFrame()
pd.options.display.max_colwidth = 200


for index, row in df.iterrows():
    # clean data
    clean_list = [x for x in list(row.values) if str(x) != 'nan']
    # create combinations
    items_combinations = list(itertools.combinations(clean_list, 2))
    # create set combinations
    joint_items_combinations = [':'.join(x) for x in items_combinations]

    # collect rest of item names
    # handle firs element
    if index == 0:
        additional_names = list(df.loc[1].values)
        additional_names = [x for x in additional_names if str(x) != 'nan']
    else:
        additional_names = list(df.loc[index-1].values)
        additional_names = [x for x in additional_names if str(x) != 'nan']

    # get set data
    result = []
    for combination, joint_combination in zip(items_combinations, joint_items_combinations):
        set_data = [item for item in clean_list if item not in combination] + [joint_combination]
        result.append((set_data, additional_names))

    # add data to data frame
    data = pd.DataFrame({"result": result})
    df2 = df2.append(data)


df2 = df2.reset_index().drop(columns=['index'])

For rows:

chicken cinnamon    ginger  onion   soy_sauce
cardamom    coconut pumpkin

Output:

                                                                      result
0   ([ginger, onion, soy_sauce, chicken:cinnamon], [cardamom, coconut, pumpkin])
1   ([cinnamon, onion, soy_sauce, chicken:ginger], [cardamom, coconut, pumpkin])
2   ([cinnamon, ginger, soy_sauce, chicken:onion], [cardamom, coconut, pumpkin])
3   ([cinnamon, ginger, onion, chicken:soy_sauce], [cardamom, coconut, pumpkin])
4   ([chicken, onion, soy_sauce, cinnamon:ginger], [cardamom, coconut, pumpkin])
5   ([chicken, ginger, soy_sauce, cinnamon:onion], [cardamom, coconut, pumpkin])
6   ([chicken, ginger, onion, cinnamon:soy_sauce], [cardamom, coconut, pumpkin])
7   ([chicken, cinnamon, soy_sauce, ginger:onion], [cardamom, coconut, pumpkin])
8   ([chicken, cinnamon, onion, ginger:soy_sauce], [cardamom, coconut, pumpkin])
9   ([chicken, cinnamon, ginger, onion:soy_sauce], [cardamom, coconut, pumpkin])
10  ([pumpkin, cardamom:coconut], [chicken, cinnamon, ginger, onion, soy_sauce])
11  ([coconut, cardamom:pumpkin], [chicken, cinnamon, ginger, onion, soy_sauce])
12  ([cardamom, coconut:pumpkin], [chicken, cinnamon, ginger, onion, soy_sauce])

Optimising iteration and substitution over large dataset

1 Answers1

Linked