Structuring dedupe results in a database

Asked Jul 15 '17 at 12:54

Active Jul 15 '17 at 12:54

Viewed 84 times

I am using the python project dedupe to find duplicate organization names in my data. Many of the examples are focused on how to process the data and not how the results are implemented. Are there any best practices for taking the results, putting it into your database, and querying to group records that are duplicates?

My thoughts so far are to structure the two tables like this (using sqlalchemy), but I feel like something is off about it:

class Organization(Base):
    __tablename__ = 'organization'

    id = Column(Integer, primary_key=True)
    name = Column(String)
    cluster_id = Column(Integer, ForeignKey('duplicate_organization.cluster_id'))


class DuplicateOrganzation(Base):
    __tablename__ = 'duplicate_organization'

    id = Column(Integer, primary_key=True)
    cluster_id = Column(Integer)
    name = Column(String)
    organizations = relationship("Organization")

asked Jul 15 '17 at 12:54

Casey

2,611
6
34
60

Structuring dedupe results in a database

0 Answers0