Calculating smallest within trio distance

Question

I have a pandas dataframe similar to the one below:

Output  var1        var2        var3
1   0.487981    0.297929    0.214090    
1   0.945660    0.031666    0.022674
2   0.119845    0.828661    0.051495
2   0.095186    0.852232    0.052582
3   0.059520    0.053307    0.887173
3   0.091049    0.342226    0.566725
3   0.119295    0.414376    0.466329
... ... ... ... ...

Basically, I have 3 columns (propensity score values) and one output (treatment). I want to calculate the within-trio distance to find trios of outputs with the smallest within-trio distance. The experiment is taken from the paper: "Matching by Propensity Score in Cohort Studies with Three Treatment Groups", Rassen et al. Looking at their explanation is like calculating the perimeter of a triangle, but I am not sure. I think that at this GitHub link: https://github.com/bwh-dope/pharmacoepi_toolbox/blob/master/src/org/drugepi/match/MatchDistanceCalculator.java there is Java code doing this stuff more or less, but I am not sure on how to use it. I use Python, so I have two options: try to adapt this previous code or write something else. My idea is that var1, var2 and var3 can be considered like spatial x,y,z coordinates, and the output is like a point in the space. I found a function that calculates the distance between 2 points:

#found here https://stackoverflow.com/questions/68938033/min-distance-between-point-cloud-xyz-points-in-python
import numpy as np

distance = lambda p1, p2: np.sqrt(np.sum((p1 - p2) ** 2, axis=0))

import itertools

def min_distance(cloud):
  pairs = itertools.combinations(cloud, 2)
  return np.min(map(lambda pair: distance(*pair), pairs))

def get_points(filename):
  with open(filename, 'r') as file:
    rows = np.genfromtxt(file, delimiter=',', skip_header=True)
  return rows


filename = 'cloud.csv'
cloud = get_points(filename)
min_dist = min_distance(cloud)

However, I want to calculate the distance between 3 points, so I think that I need to iterate all the possible combinations of 3 points like XY, XZ and YZ, but I am not sure of this procedure.

Since StackOverflow is not a Code translation service, we can't do that for you or evaluate if it's correct. You should: **1.** Add an explanation of how to calculate the 'within-trio' distance. **2.** Show what you have tried so far (code) in a [**Minimal, Reproducible Example**](https://stackoverflow.com/help/minimal-reproducible-example). **3.** Btw, what format is your dataset? List of lists, list of dicts, pandas dataframe... ? — aneroid, Nov 08 '22 at 11:37
Close vote retracted. Provide a way to set your input data correctly to pass in to the function for `cloud`. — aneroid, Nov 08 '22 at 16:51

CasellaJr · Accepted Answer · 2022-11-11T17:24:32.510

Finally, I tried with my own solution, that I think it is correct, but maybe too much computationally expensive. I created my 3 dataset, according to the Output value: dataset1 = dataset[dataset["Output"]==1] and the same for Output=2 and Output=3. This is my distance function:

def Euclidean_Dist(df1, df2):
    return np.linalg.norm(df1 - df2)

My variables:

tripletta_for = []
tripletta_tot_wr = []

p_inf = float('inf')

counter = 1

These are the steps used to computed the within-trio distance. Hope they are correct.

'''
i[0] = index
i[1] = treatment prop1
i[1][0] = treatment
i[1][1] = prop
'''
#io voglio calcolare la distanza tra i[1][1], j[1][1] e k[1][1]

for i in dataset1.iterrows():
    minimum_distance = p_inf
    print(counter)
    counter = counter + 1
    for j in dataset2.iterrows():
        dist12 = Euclidean_Dist(i[1][1], j[1][1])
        for k in dataset3.iterrows():
            dist13 = Euclidean_Dist(i[1][1], k[1][1])
            dist23 = Euclidean_Dist(j[1][1], k[1][1])
            somma = dist12 + dist13 + dist23
            if somma < minimum_distance:
                minimum_distance = somma
                tripletta_for = i[0], j[0], k[0]
                #print(tripletta_for)
    dataset2.drop(index=tripletta_for[1], inplace=True)
    dataset3.drop(tripletta_for[2], inplace=True)
    #print(len(dataset3))
    tripletta_tot_wr.append(tripletta_for)
    #print(tripletta_tot_wr)

Calculating smallest within trio distance

1 Answers1