For a given pandas dataframe df
, I would like to compare every sample (row) with each other.
For bigger datasets this would lead to too many comparisons (n**2
). Therefore, it is necessary to perform these comparisons only for smaller groups (i.e. for all of those which share the same id
) and as efficiently as possible.
I would like to construct a dataframe (df_pairs
), which contains in every row one pair. Additionally, I would like to get all pair indices (ideally as a Python set).
First, I construct an example dataframe:
import numpy as np
import pandas as pd
from functools import reduce
from itertools import product, combinations
n_samples = 10_000
suffixes = ["_1", "_2"] # for df_pairs
id_str = "id"
df = pd.DataFrame({id_str: np.random.randint(0, 10, n_samples),
"A": np.random.randint(0, 100, n_samples),
"B": np.random.randint(0, 100, n_samples),
"C": np.random.randint(0, 100, n_samples)}, index=range(0, n_samples))
columns_df_pairs = ([elem + suffixes[0] for elem in df.columns] +
[elem + suffixes[1] for elem in df.columns])
In the following, I am comparing 4 different options with the corresponding performance measures:
Option 1
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [set(product(elem.tolist(), repeat=2)) for _, elem in groups.items()] # determine pairs per group
set_of_pairs = reduce(set.union, pairs_per_group) # convert all groups into one set
idcs1, idcs2 = zip(*[(e1, e2) for e1, e2 in set_of_pairs])
df_pairs = pd.DataFrame(np.hstack([df.values[idcs1, :], df.values[idcs2, :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_tuples(set_of_pairs, names=('index 1', 'index 2')))
df_pairs.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 1 takes 34.2 s ± 1.28 s.
Option 2
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [np.array(np.meshgrid(elem.values, elem.values)).T.reshape(-1, 2) for _, elem in groups.items()]
idcs = np.unique(np.vstack(pairs_per_group), axis=0)
df_pairs2 = pd.DataFrame(np.hstack([df.values[idcs[:, 0], :], df.values[idcs[:, 1], :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_arrays([idcs[:, 0], idcs[:, 1]], names=('index 1', 'index 2')))
df_pairs2.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 2 takes 13 s ± 1.34 s.
Option 3
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [np.array([np.tile(elem.values, len(elem.values)), np.repeat(elem.values, len(elem.values))]).T.reshape(-1, 2) for _, elem in groups.items()]
idcs = np.unique(np.vstack(pairs_per_group), axis=0)
df_pairs3 = pd.DataFrame(np.hstack([df.values[idcs[:, 0], :], df.values[idcs[:, 1], :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_arrays([idcs[:, 0], idcs[:, 1]], names=('index 1', 'index 2')))
df_pairs3.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 3 takes 12.1 s ± 347 ms.
Option 4
df_pairs4 = pd.merge(left=df, right=df, how="inner", on=id_str, suffixes=suffixes)
# here, I do not know how to get the MultiIndex in
df_pairs4.drop([id_str], inplace=True, axis=1)
Option 4 is computed the quickest with 1.41 s ± 239 ms. However, I do not have the paired indices in this case.
I could improve the performance a little bit by using comparisons
instead of product
of itertools. I could also build the comparison matrix and use only the upper triangular one and construct my dataframe from there. This however does not seem to be more efficient than performing the cartesian product and removing the self-references as well as inverse comparisons (a, b) = (b, a)
.
- Could you tell me a more efficient way to get pairs for comparison (ideally as a set to be able to use set operations)?
- Could I use
merge
or anotherpandas
function to construct my desired dataframe with the multi-indices?