1

I'm attempting to link records between datasets with no common key to identify matches. For both datasets I may have none, one, or more addresses per record.

How do you best setup the Python dedupe library to handle lists? I've poured over Google and the documentation and haven't found anything specific.

Thanks!

import dedupe

# Not sure what to do here
fields = [
    {
        'field': 'address', 
        'type': 'String'
    },
    {
        'field': 'addresses', 
        'type': 'String'
    },
    {
        'field': 'name', 
        'type': 'String'
    }
]

left_data = {
    'name': 'john doe',
    'addresses': ['11 Washington Ave', '21 Jump St.']
}

right_data = {
    'name': 'jon doee',
    'address': '11 Washington Avneue'
}

linker = dedupe.RecordLink(fields)
linker.prepare_training(left_data, right_data, sample_size=1000)

dedupe.console_label(linker)
linker.train()

linked_records = linker.join(left_data, right_data, 0.0)
Douglas Plumley
  • 565
  • 5
  • 21
  • `if right_data['address'] in left_data['addresses']`? – n1c9 Jun 30 '20 at 18:27
  • 1
    @n1c9 I want to do the comparison using the dedupe library, I updated the example to intentionally misspell things to show this isn't a straight comparison. There has to be some form of "fuzzy" matching which is where the dedupe library is coming in. – Douglas Plumley Jun 30 '20 at 18:38
  • 1
    I have the same problem. I am trying to deal with it by duplicating the field for each element of the array... But there has to be a more elegant solution – popololvic May 05 '21 at 09:48

0 Answers0