-1

i have a dataframe that has a column of lists of string ids. (see below). I want to create a distance matrix between all pairwise "distances" between all the rows (e.g. if 10 rows, then it's a 10x 10 matrix). the rows are lists of ids, so I'm not sure how things like pdist can be used.

the values are string ids. just like string names


ids
0   [58545-19, 462423-43, 277581-25]
1   [0]
2   [454950-82, 433701-46, 228790-63, 266250-52, 458759-98, 152986-78, 222217-39, 433515-16, 265589-83, 439403-23, 277892-38, 223497-19, 224072-83, 461887-57, 436147-12, 227479-78, 228893-32, 279415-18, 439426-27, 437742-46, 438156-73, 438458-68, 277898-05, 438675-76, 454658-95, 431222-77, 462579-94, 434939-86, 222211-09, 178215-13, 459566-11, 463200-04, 439278-94, 459505-18, 399139-66, 455735-62, 327382-03, 439040-62, 233779-51, 431387-38, 438589-72, 437892-49, 458178-76]
3   [431380-63]
4   [442539-01, 434388-16, 454950-82, 463197-61, 228893-32, 464322-07, 462579-94, 438781-51, 437273-11, 265395-79, 463560-76, 462525-31, 439426-27, 438458-68, 464300-38, 442676-80]
5   [234729-10, 435926-98, 416670-04, 179514-28]
6   [0]
7   [0]
8   [267726-25, 235217-71, 227314-72, 185293-18, 434447-56, 170271-19, 454661-20]
9   [0]
nerdlyfe
  • 487
  • 7
  • 21
  • where are numeric values to calculate distance?? is the "58545-19" just an id in first element of first list in 0 index?? OR an numeric value(58545-19=58526) for distance?? if you use to_dict method, then we could make sample data more easily.. – sanzo213 Jul 26 '21 at 07:35
  • "58545-19" = is just a string id. treatable as a string like a persons name. – nerdlyfe Jul 26 '21 at 08:21
  • then does your data has numeric value corresponding to each value(i.e. 58545-19)? OR value corresponding to a list(i.e. [58545-19, 462423-43, 277581-25])? – sanzo213 Jul 26 '21 at 08:30
  • they are list of string id's, they just happen to be numbers, they could any id strings, like names. so they can repeat (e.g. same name can appear in other sets...hence the goal to find similar sets. thx you! – nerdlyfe Jul 26 '21 at 09:19
  • How do you define "distance" in this context? If the values are just ids, what is `d: string x string -> number`? Also, please add a sample of the expected output. – PeterE Jul 26 '21 at 11:33
  • I just reread the question: Do you wish to calculate the distance between then length of the id lists? – PeterE Jul 26 '21 at 11:48
  • it's a distance between sets. jaccard distance is typically used. I think i've solved it. i'll share here soon. – nerdlyfe Jul 26 '21 at 12:54

2 Answers2

2

Here is a solution using the scipy.spatial.distance.pdist function to compute the pairwise distances (see full code at the end).

Step by step

custom jaccard function

While scipy.spatial.distance has a jaccard method, this one is made for boolean arrays. We will need to define a custom function (using this definition of the jaccard distance: 1-intersection/union):

def jaccard(u, v):
    u,v = set(u[0]), set(v[0]) # pdist will pass 2D data [[a,b,c]], so we need to slice
    return 1-len(u.intersection(v))/len(u.union(v))

Then we apply it on our dataframe column.

Warning: pdist expects a multidimensional array as input (Series won't work), so we need to slice the column as DataFrame (df[['ids']]). Also, passing directly the function as metric would cause an error as the function is not vectorized (see comment on that point below), so we need to wrap it in a lambda.

pdist(df[['ids']], metric=lambda u,v: jaccard(u,v))

As mentioned above, it is also possible to pass a vectorized function instead. For this, we can use numpy.vectorize. Note that the function is slightly different than previously. Here we do not slice the first element of the passed values as it is already 1D.

def jaccard(u, v):
    u,v = set(u), set(v)
    return 1-len(u.intersection(v))/len(u.union(v))

pdist(df[['ids']], metric=np.vectorize(jaccard))

NB. A quick test on the provided dataset showed that the vectorized approach is actually slower than the lambda.

output as 2D

Finally, we transform the output back to matrix using scipy.spatial.distance.squareform and the pandas.DataFrame constructor:

pd.DataFrame(squareform(pdist(df[['ids']], metric=lambda u,v: jaccard(u,v))))

Example (full code)

Let's start from this input:

df = pd.DataFrame([[['58545-19', '462423-43', '277581-25']],
                   [['0']],
                   [['454950-82', '433701-46', '228790-63', '266250-52', '458759-98', '152986-78', '222217-39', '433515-16', '265589-83', '439403-23', '277892-38', '223497-19', '224072-83', '461887-57', '436147-12', '227479-78', '228893-32', '279415-18', '439426-27', '437742-46', '438156-73', '438458-68', '277898-05', '438675-76', '454658-95', '431222-77', '462579-94', '434939-86', '222211-09', '178215-13', '459566-11', '463200-04', '439278-94', '459505-18', '399139-66', '455735-62', '327382-03', '439040-62', '233779-51', '431387-38', '438589-72', '437892-49', '458178-76']],
                   [['431380-63']],
                   [['442539-01', '434388-16', '454950-82', '463197-61', '228893-32', '464322-07', '462579-94', '438781-51', '437273-11', '265395-79', '463560-76', '462525-31', '439426-27', '438458-68', '464300-38', '442676-80']],
                   [['234729-10', '435926-98', '416670-04', '179514-28']],
                   [['0']],
                   [['0']],
                   [['267726-25', '235217-71', '227314-72', '185293-18', '434447-56', '170271-19', '454661-20']],
                   [['0']],
                  ], columns=['ids'])
from scipy.spatial.distance import pdist, squareform

def jaccard(u, v):
    u,v = set(u[0]), set(v[0])
    return 1-len(u.intersection(v))/len(u.union(v))

pd.DataFrame(squareform(pdist(df[['ids']], metric=lambda u,v: jaccard(u,v))))

output:

     0    1         2    3         4    5    6    7    8    9
0  0.0  1.0  1.000000  1.0  1.000000  1.0  1.0  1.0  1.0  1.0
1  1.0  0.0  1.000000  1.0  1.000000  1.0  0.0  0.0  1.0  0.0
2  1.0  1.0  0.000000  1.0  0.907407  1.0  1.0  1.0  1.0  1.0
3  1.0  1.0  1.000000  0.0  1.000000  1.0  1.0  1.0  1.0  1.0
4  1.0  1.0  0.907407  1.0  0.000000  1.0  1.0  1.0  1.0  1.0
5  1.0  1.0  1.000000  1.0  1.000000  0.0  1.0  1.0  1.0  1.0
6  1.0  0.0  1.000000  1.0  1.000000  1.0  0.0  0.0  1.0  0.0
7  1.0  0.0  1.000000  1.0  1.000000  1.0  0.0  0.0  1.0  0.0
8  1.0  1.0  1.000000  1.0  1.000000  1.0  1.0  1.0  0.0  1.0
9  1.0  0.0  1.000000  1.0  1.000000  1.0  0.0  0.0  1.0  0.0

Here is a graphical representation of the distances for the provided dataset (white = further away): heatmap distances

mozway
  • 194,879
  • 13
  • 39
  • 75
1

If you want to compute the Jaccard distance between the lists, thus based on the number of items in common, you can iterate through the rows, compute the dissimilarity and then construct your distances DataFrame. Moreover, since the resulting DataFrame will be symmetric, to optimize the computation you can construct just the upper triangle and then copy it into the lower triangle to create the full DataFrame.

Starting from a Dataframe df containing the IDs, you can do this in the following way:

def jaccard(a, b):
    a, b = set(a), set(b)
    c = a.intersection(b)
    return 1 - float(len(c)) / (len(a) + len(b) - len(c))

distances = pd.DataFrame(columns=range(df.shape[0]))

for i in range(0, len(df)):
    for j in range(i, len(df)):
        distances.loc[i, j] = jaccard(df['ids'].iloc[i],df['ids'].iloc[j])

distances[distances.isnull()] = distances.transpose()
Giulio Mattolin
  • 620
  • 4
  • 14