2

For a given pandas dataframe df, I would like to compare every sample (row) with each other.

For bigger datasets this would lead to too many comparisons (n**2). Therefore, it is necessary to perform these comparisons only for smaller groups (i.e. for all of those which share the same id) and as efficiently as possible.

I would like to construct a dataframe (df_pairs), which contains in every row one pair. Additionally, I would like to get all pair indices (ideally as a Python set).

First, I construct an example dataframe:

import numpy as np
import pandas as pd
from functools import reduce
from itertools import product, combinations

n_samples = 10_000
suffixes = ["_1", "_2"]  # for df_pairs
id_str = "id"

df = pd.DataFrame({id_str: np.random.randint(0, 10, n_samples),
                   "A": np.random.randint(0, 100, n_samples),
                   "B": np.random.randint(0, 100, n_samples),
                   "C": np.random.randint(0, 100, n_samples)}, index=range(0, n_samples))

columns_df_pairs = ([elem + suffixes[0] for elem in df.columns] + 
                    [elem + suffixes[1] for elem in df.columns])

In the following, I am comparing 4 different options with the corresponding performance measures:

Option 1

groups = df.groupby(id_str).groups  # get the groups
pairs_per_group = [set(product(elem.tolist(), repeat=2)) for _, elem in groups.items()]  # determine pairs per group
set_of_pairs = reduce(set.union, pairs_per_group)  # convert all groups into one set
idcs1, idcs2 = zip(*[(e1, e2) for e1, e2 in set_of_pairs])
df_pairs = pd.DataFrame(np.hstack([df.values[idcs1, :], df.values[idcs2, :]]), # construct the dataframe of pairs
                        columns=columns_df_pairs, 
                        index=pd.MultiIndex.from_tuples(set_of_pairs, names=('index 1', 'index 2')))
df_pairs.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)

Option 1 takes 34.2 s ± 1.28 s.

Option 2

groups = df.groupby(id_str).groups  # get the groups
pairs_per_group = [np.array(np.meshgrid(elem.values, elem.values)).T.reshape(-1, 2) for _, elem in groups.items()]
idcs = np.unique(np.vstack(pairs_per_group), axis=0)
df_pairs2 = pd.DataFrame(np.hstack([df.values[idcs[:, 0], :], df.values[idcs[:, 1], :]]), # construct the dataframe of pairs
                        columns=columns_df_pairs, 
                        index=pd.MultiIndex.from_arrays([idcs[:, 0], idcs[:, 1]], names=('index 1', 'index 2')))
df_pairs2.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)

Option 2 takes 13 s ± 1.34 s.

Option 3

groups = df.groupby(id_str).groups  # get the groups
pairs_per_group = [np.array([np.tile(elem.values, len(elem.values)), np.repeat(elem.values, len(elem.values))]).T.reshape(-1, 2) for _, elem in groups.items()]
idcs = np.unique(np.vstack(pairs_per_group), axis=0)
df_pairs3 = pd.DataFrame(np.hstack([df.values[idcs[:, 0], :], df.values[idcs[:, 1], :]]), # construct the dataframe of pairs
                        columns=columns_df_pairs, 
                        index=pd.MultiIndex.from_arrays([idcs[:, 0], idcs[:, 1]], names=('index 1', 'index 2')))
df_pairs3.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)

Option 3 takes 12.1 s ± 347 ms.

Option 4

df_pairs4 = pd.merge(left=df, right=df, how="inner", on=id_str, suffixes=suffixes)
# here, I do not know how to get the MultiIndex in
df_pairs4.drop([id_str], inplace=True, axis=1)

Option 4 is computed the quickest with 1.41 s ± 239 ms. However, I do not have the paired indices in this case.

I could improve the performance a little bit by using comparisons instead of product of itertools. I could also build the comparison matrix and use only the upper triangular one and construct my dataframe from there. This however does not seem to be more efficient than performing the cartesian product and removing the self-references as well as inverse comparisons (a, b) = (b, a).

  • Could you tell me a more efficient way to get pairs for comparison (ideally as a set to be able to use set operations)?
  • Could I use merge or another pandas function to construct my desired dataframe with the multi-indices?
xaneon
  • 317
  • 3
  • 9

1 Answers1

1

An inner merge will destroy the index in favor of a new Int64Index. If the index is important bring it along as a column by reset_index, then set those columns back to the Index.

df_pairs4 = (pd.merge(left=df.reset_index(), right=df.reset_index(), 
                      how="inner", on=id_str, suffixes=suffixes)
               .set_index(['index_1', 'index_2']))

                 id  A_1  B_1  C_1  A_2  B_2  C_2
index_1 index_2                                  
0       0         4   92   79   10   92   79   10
        13        4   92   79   10   83   68   69
        24        4   92   79   10   67   73   90
        25        4   92   79   10   22   31   35
        36        4   92   79   10   64   44   20
...              ..  ...  ...  ...  ...  ...  ...
9993    9971      7   20   65   92   47   65   21
        9977      7   20   65   92   50   35   27
        9980      7   20   65   92   43   36   62
        9992      7   20   65   92   99    2   17
        9993      7   20   65   92   20   65   92
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Thank you very much. I now removed identical indices `df_pairs4=df_pairs4[df_pairs4["index_1"]!=df_pairs4["index_2"]]` and the reversed ones `df_pairs4=pd.DataFrame(np.sort(df_pairs4.values,axis=1),columns=df_pairs4.columns).drop_duplicates()` before setting the index `df_pairs4.set_index(["index_1","index_2"])`. I am not so sure whether this is actually the most efficient way to do that. However it seems faster than using `combinations` similar to the example above. – xaneon Feb 05 '20 at 20:42
  • @xaneon the `np.sort` method will be very fast to remove things regardless of order :D – ALollz Feb 05 '20 at 21:42