-2

I am trying to automate a process that is normally done by hand. There is a database full of unique keys that gets updated quarterly. This update will have some of the same exact keys and sometimes it won't but they will be similar. They will add something to the front and end or add some zeros. Here is an example of some of the already established keys and their matches:

0509000004 -> 594

0509000005 -> 595

0509000006 -> 596

0510000003 -> 5103

0311100000004000 -> 099031110000000400

0311100000004020 -> 099031110000000402

063050000100 -> 63500100

06308C000400 -> 638C0400

So as you can see there is always variation to the way the key is changed and the next time I get data it can have a few more variations. I wish I could tell the provider of the data to stop changing it but that is not up to me.

I have tried using the Levenshtein Distance algorithm like fuzzy searching in conjunction with stripping of certain numbers but it does not always get it correctly due to the large variation.

Any suggestions would be much appreciated!

Mnazanda
  • 5
  • 3
  • 1
    Your numbers look entirely random to me. If there's no clear rule as to how they change the numbers, you cannot reliably automate this kind of matching with 100% certainty. Every time you'll find a match, you'll have to wonder if you found the right match or something that's similar but should be ignored. – xxbbcc Jul 02 '15 at 15:38
  • Your points are valid. It is pretty random which is why I am struggling. But you can see that there is a pattern for each of these. I don't know if I should do multiple rounds with each pattern that I know of or if there is a better solution out there – Mnazanda Jul 02 '15 at 15:43
  • What about a Regex Replace Dictionary where you just fill in your patterns? Probably stored in a configuration file? – Matt Jul 02 '15 at 15:45
  • All I see is various violations of an assumed pattern. – xxbbcc Jul 02 '15 at 15:46
  • I will look into that Matt. Thank you. – Mnazanda Jul 02 '15 at 15:47
  • Why `0509000006` maps to `596`, not `5906` (like the one following it)? – xxbbcc Jul 02 '15 at 15:47
  • @xxbbcc - I'm not sure I know exactly what you mean. There are three pretty distinct patterns. Can you explain more? – Mnazanda Jul 02 '15 at 15:49
  • If you see the 5103 one. The 2 digits following the 5 are kept and then the last digit is kept as well – Mnazanda Jul 02 '15 at 15:50
  • Actually I see what you are saying. That would make the others 5096 – Mnazanda Jul 02 '15 at 15:51
  • I guess it would be the one directly after the 5 is kept if it is not 0 – Mnazanda Jul 02 '15 at 15:52

1 Answers1

1

Here is a small example, where you just need to maintain the regexes

var list = new[]
    {
        "0509000004", "0509000005", "0509000006", "0510000003", 
        "0311100000004000", "0311100000004020",
        "063050000100", "06308C000400"
    };


var results = new List<string>();

List<Tuple<string, string>> regexes = new List<Tuple<string, string>>{
    new Tuple<string, string>( "^0(5)(?:0(\\d)|(\\d{1,2}))0*(\\d*)", "$1$2$3$4"),
    new Tuple<string, string>( "^(03111)(0+)([4]\\d{2})0$", "099$1$2$3"),
};

foreach (var number in list)
{
    foreach (var regex in regexes)
    {
        if (Regex.IsMatch(number, regex.Item1))
        {
            results.Add(Regex.Replace(number, regex.Item1, regex.Item2));
            break;
        }
    }
}
Matt
  • 4,612
  • 1
  • 24
  • 44