I've made a post here, yet as I got no answer as per now I thought maybe to try it also here as I've found it relevant.
I have the following code:
import pandas as pd
import numpy as np
import itertools
from pprint import pprint
# Importing the data
df=pd.read_csv('./GPr.csv', sep=',',header=None)
data=df.values
res = np.array([[i for i in row if i == i] for row in data.tolist()], dtype=object)
# This function will make the subsets of a list
def subsets(m,n):
z = []
for i in m:
z.append(list(itertools.combinations(i, n)))
return(z)
# Make the subsets of size 2
l=subsets(res,2)
l=[val for sublist in l for val in sublist]
Pairs=list(dict.fromkeys(l))
# Modify the pairs:
mod=[':'.join(x) for x in Pairs]
# Define new lists
t0=res.tolist()
t0=map(tuple,t0)
t1=Pairs
t2=mod
# Make substitions
result = []
for v1, v2 in zip(t1, t2):
out = []
for i in t0:
common = set(v1).intersection(i)
if set(v1) == common:
out.append(tuple(list(set(i) - common) + [v2]))
else:
out.append(tuple(i))
result.append(out)
pprint(result, width=200)
# Delete duplicates
d = {tuple(x): x for x in result}
remain= list(d.values())
What it does is as follows: First, we import the csv file we want to work with in here. You can see that it is a list of elements, for each element we find the subsets of size two. We then write a modification to the subsets and call it mod
. What it does is to take say ('a','b')
and convert it to 'a:b'
. We then, for each pair, go through the original data and where ever we find the pairs we substitute them. Finally we delete all the duplicates as it is given.
The code works fine for small set of data. Yet the problem is that the file I have, has 30082 pairs where for each the list of ~49000 list should be scanned and pairs being replaced. I run this in Jupyter and after some time the Kernel dies. I wonder how one can optimise this?