fuzzy matching of nested data (ex: individuals \in families \in groups)

Question

I am trying to match groups across two datasets, D1,D2. Each dataset represents the groupings at time t=1 and t=2.

Assume that the data have 3 nested levels:

i: individuals have characteristics: X,Y,Z
f: families are made of individuals
g: groups are made of families

I am trying to match the groups in D1 and D2 based on having the most families matched. Family matches are defined by individuals matches, based on the individual characteristics X,Y,Z.

Difficulties: Individuals may leave families. Families may change to different groups. And individuals may leave a family and migrate to another family (existing or new). Individual characteristics (X,Y,Z) may also be recorded with error.

I am looking for some fuzzy matching algorithm/procedure that incorporates the nested structure of this data.

Is there a computer science / data-science term to describe this kind of match? Any implementation of this in R or Python?

My rough idea for this:

For every pair of groups (g1, g2, where g1 comes from D1 and g2 comes from D2)
Take a pair of families (f1 of g1, f2 of g2)
For each family pair (f1,f2), compute the fuzzy distances between individual members
Then create a similarity index for the family pair.
Compute the allocation that best matches families between g1, g2
Compute the sum of all family similarity indexes. Call this the group_match(g1,g2).
Choose the pairs that maximize the group_match

@Julien, by nested levels, I mean that individuals belong to families that belong to groups. — LucasMation, Jul 30 '22 at 12:28
This may not be fuzzy in the sense of fuzzy string matching for example. From here it seems that all you need is a method for scoring similarities of nested objects with time-dependent properties. I think a primitive example of your data and expected result would help us help you. — gaut, Jul 30 '22 at 23:05

MangoNrFive · Answer 1 · 2022-08-04T11:43:03.893

You could create a match-metric between individuals that is then accumulated for matches between families and then between groups. The concrete implementation depends a lot on exactly what your data looks like and how you want to define a match between individiuals/families/groups (maybe mean of max is not the right metric for similarity here).

You could use something like this, with your own metric for the match-methods customized to your use-case:

from dataclasses import dataclass
import numpy as np
import statistics


@dataclass
class Individual:
    X: str
    Y: str
    Z: str

    def match(self, other):
        return statistics.mean(
            (self.X == other.X, self.Y == other.Y, self.Z == other.Z)
        ) ** 2  # square to put a higher weight on good matches


@dataclass
class Family:
    individuals: list[Individual]

    def match(self, other):
        return statistics.mean(
            max(self_individual.match(other_individual) for other_individual in other.individuals)
            for self_individual in self.individuals
        )


@dataclass
class Group:
    families: list[Family]

    def match(self, other):
        return statistics.mean(
            max(self_family.match(other_family) for other_family in other.families)
            for self_family in self.families
        )


i01 = Individual("blond", "blue", "tall")
i02 = Individual("blond", "green", "huge")
i03 = Individual("brown", "green", "small")
i04 = Individual("blond", "blue", "average")
i05 = Individual("blond", "green", "tall")
i06 = Individual("brown", "brown", "average")
i07 = Individual("red", "green", "small")
i08 = Individual("red", "green", "small")
i09 = Individual("brown", "green", "tall")
i10 = Individual("black", "brown", "average")
i11 = Individual("brown", "green", "small")
i12 = Individual("red", "blue", "average")
i13 = Individual("red", "green", "tall")
i14 = Individual("brown", "brown", "huge")
i15 = Individual("brown", "green", "average")
i16 = Individual("red", "blue", "tall")
i17 = Individual("red", "green", "small")
i18 = Individual("brown", "brown", "tiny")
i19 = Individual("blond", "brown", "average")


g1_t1 = Group([
    Family([i01, i02]),
    Family([i03, i04, i05, i06]),
])

g2_t1 = Group([
    Family([i07, i08, i09]),
    Family([i10, i11]),
    Family([i12, i13]),
])

g3_t1 = Group([
    Family([i14, i15, i16, i17]),
    Family([i18, i19]),
])


# groups in t2 with some migrations / changes
g1_t2 = Group([
    Family([i01, i02, i17]),  # i17 migrated from f1 of g3
    Family([i03, i04, i05, Individual("brown", "brown", "tall")]),  # i06 recorded with different height
])

g2_t2 = Group([
    Family([i07, i08, i09]),
    #  Family([i10, i11]),  migrated to g3
    Family([i12, i13]),
])

g3_t2 = Group([
    Family([i14, i15, i16]),  # i17 migrated to f1 of g1
    Family([Individual("black", "brown", "tiny"), i19]),  # i18 recorded with wrong hair-color
    Family([i10, i11]),  # migrated from g2
])


t1 = [g1_t1, g2_t1, g3_t1]
t2 = [g1_t2, g2_t2, g3_t2]


matches = np.ndarray((len(t1), len(t2)))
for X, gX_t1 in enumerate(t1):
    for Y, gY_t2 in enumerate(t2):
        matches[Y, X] = (gX_t1.match(gY_t2) + gY_t2.match(gX_t1)) / 2

print(matches)

This gives the following matrix for the matches:

[[0.85648148 0.4691358  0.41435185]
 [0.31944444 0.87037037 0.43287037]
 [0.39583333 0.59259259 0.70833333]]

You see the best matches on the diagonal as it should be (same group matched for different time on diagonal).

The similarity between different groups (rest of matrix) is still relatively high. The reason is, that there are only 3 characteristics that are not very diverse either. The metric used with mean of max exaggerates this further. This "noise" should go down if you have more characteristics you can match against, especially if the characteristics are more diverse.

To calculate the best match between groups overall and not just for one pair from the matrix, you can take a look at this question.

fuzzy matching of nested data (ex: individuals \in families \in groups)

1 Answers1