Encoding record samples for expectation maximization algorithm

Question

First, I'm a programmer without a data science background, so my working knowledge of statistics is quite limited.

I'm creating an entity matching tool to match records across internal datasets. I want to use the probabilistic matching technique described in these documents*. I have a good understanding of how the technique works and how to apply it, except for the derivation of agreement/disagreement weights using expectation maximization (EM).

Specifically, I'm unclear on how to encode my record pairs into the double[][] format required for

The EM implementation I have available is the Apache Common Math MultivariateNormalMixtureExpectationMaximization.

Here is a concrete example: matching company records.

A company has two fields: name (string) and country (enum), and I want to generate the m and u probabilistic weights using EM. How do I create the double[][] dataset for each field to feed into EM?

In the case of name, it is a string so there will be an approximate agreement / disagreement, using some string similarity method (edit distance, phonetic index, etc., the details aren't relevant here)

In the case of country, my data is normalized so agreement will only occur on an exact match. However certain countries are over and under represented. So a record with an under-represented country should have a higher weight than one with an over-represented country.

What exactly do the values in the inner double[] mean/represent?
How many entries/columns should there be?
How do I encode the records into the double[]?

* the documents describing the probabilistic matching technique using EM

Encoding record samples for expectation maximization algorithm

0 Answers0