First, I'm a programmer without a data science background, so my working knowledge of statistics is quite limited.
I'm creating an entity matching tool to match records across internal datasets. I want to use the probabilistic matching technique described in these documents*. I have a good understanding of how the technique works and how to apply it, except for the derivation of agreement/disagreement weights using expectation maximization (EM).
Specifically, I'm unclear on how to encode my record pairs into the double[][]
format required for
The EM implementation I have available is the Apache Common Math MultivariateNormalMixtureExpectationMaximization.
Here is a concrete example: matching company records.
A company has two fields: name (string)
and country (enum)
, and I want to generate the m and u probabilistic weights using EM. How do I create the double[][]
dataset for each field to feed into EM?
In the case of name
, it is a string so there will be an approximate agreement / disagreement, using some string similarity method (edit distance, phonetic index, etc., the details aren't relevant here)
In the case of country
, my data is normalized so agreement will only occur on an exact match. However certain countries are over and under represented. So a record with an under-represented country should have a higher weight than one with an over-represented country.
- What exactly do the values in the inner
double[]
mean/represent? - How many entries/columns should there be?
- How do I encode the records into the
double[]
?
* the documents describing the probabilistic matching technique using EM