You could create a match-metric between individuals that is then accumulated for matches between families and then between groups. The concrete implementation depends a lot on exactly what your data looks like and how you want to define a match between individiuals/families/groups (maybe mean of max is not the right metric for similarity here).
You could use something like this, with your own metric for the match-methods customized to your use-case:
from dataclasses import dataclass
import numpy as np
import statistics
@dataclass
class Individual:
X: str
Y: str
Z: str
def match(self, other):
return statistics.mean(
(self.X == other.X, self.Y == other.Y, self.Z == other.Z)
) ** 2 # square to put a higher weight on good matches
@dataclass
class Family:
individuals: list[Individual]
def match(self, other):
return statistics.mean(
max(self_individual.match(other_individual) for other_individual in other.individuals)
for self_individual in self.individuals
)
@dataclass
class Group:
families: list[Family]
def match(self, other):
return statistics.mean(
max(self_family.match(other_family) for other_family in other.families)
for self_family in self.families
)
i01 = Individual("blond", "blue", "tall")
i02 = Individual("blond", "green", "huge")
i03 = Individual("brown", "green", "small")
i04 = Individual("blond", "blue", "average")
i05 = Individual("blond", "green", "tall")
i06 = Individual("brown", "brown", "average")
i07 = Individual("red", "green", "small")
i08 = Individual("red", "green", "small")
i09 = Individual("brown", "green", "tall")
i10 = Individual("black", "brown", "average")
i11 = Individual("brown", "green", "small")
i12 = Individual("red", "blue", "average")
i13 = Individual("red", "green", "tall")
i14 = Individual("brown", "brown", "huge")
i15 = Individual("brown", "green", "average")
i16 = Individual("red", "blue", "tall")
i17 = Individual("red", "green", "small")
i18 = Individual("brown", "brown", "tiny")
i19 = Individual("blond", "brown", "average")
g1_t1 = Group([
Family([i01, i02]),
Family([i03, i04, i05, i06]),
])
g2_t1 = Group([
Family([i07, i08, i09]),
Family([i10, i11]),
Family([i12, i13]),
])
g3_t1 = Group([
Family([i14, i15, i16, i17]),
Family([i18, i19]),
])
# groups in t2 with some migrations / changes
g1_t2 = Group([
Family([i01, i02, i17]), # i17 migrated from f1 of g3
Family([i03, i04, i05, Individual("brown", "brown", "tall")]), # i06 recorded with different height
])
g2_t2 = Group([
Family([i07, i08, i09]),
# Family([i10, i11]), migrated to g3
Family([i12, i13]),
])
g3_t2 = Group([
Family([i14, i15, i16]), # i17 migrated to f1 of g1
Family([Individual("black", "brown", "tiny"), i19]), # i18 recorded with wrong hair-color
Family([i10, i11]), # migrated from g2
])
t1 = [g1_t1, g2_t1, g3_t1]
t2 = [g1_t2, g2_t2, g3_t2]
matches = np.ndarray((len(t1), len(t2)))
for X, gX_t1 in enumerate(t1):
for Y, gY_t2 in enumerate(t2):
matches[Y, X] = (gX_t1.match(gY_t2) + gY_t2.match(gX_t1)) / 2
print(matches)
This gives the following matrix for the matches:
[[0.85648148 0.4691358 0.41435185]
[0.31944444 0.87037037 0.43287037]
[0.39583333 0.59259259 0.70833333]]
You see the best matches on the diagonal as it should be (same group matched for different time on diagonal).
The similarity between different groups (rest of matrix) is still relatively high. The reason is, that there are only 3 characteristics that are not very diverse either. The metric used with mean of max exaggerates this further. This "noise" should go down if you have more characteristics you can match against, especially if the characteristics are more diverse.
To calculate the best match between groups overall and not just for one pair from the matrix, you can take a look at this question.