I'm attempting to link records between datasets with no common key to identify matches. For both datasets I may have none, one, or more addresses per record.
How do you best setup the Python dedupe library to handle lists? I've poured over Google and the documentation and haven't found anything specific.
Thanks!
import dedupe
# Not sure what to do here
fields = [
{
'field': 'address',
'type': 'String'
},
{
'field': 'addresses',
'type': 'String'
},
{
'field': 'name',
'type': 'String'
}
]
left_data = {
'name': 'john doe',
'addresses': ['11 Washington Ave', '21 Jump St.']
}
right_data = {
'name': 'jon doee',
'address': '11 Washington Avneue'
}
linker = dedupe.RecordLink(fields)
linker.prepare_training(left_data, right_data, sample_size=1000)
dedupe.console_label(linker)
linker.train()
linked_records = linker.join(left_data, right_data, 0.0)