1

Two questions:

  • How to interpret the 'confidence score' when there is cluster with 3 rows and 3 confidence scores (0.98, 0.45, 0.45). Where this confidence scores come from? From logistic regression or somehow from hierarchical clustering?

  • 10 000 of my 16 millions is labeled as duplicates, should I put this all as trening data? or only 10 positive and 10 negative will be enough? what number will be better for quality and time of execution?

lubom
  • 329
  • 2
  • 13

1 Answers1

1

the confidence score is 1 - square root of the average squared distance between the record and the other records in the cluster, where distance is 1 - predicted probability that a pair of records are coreferent

See https://docs.dedupe.io/en/latest/API-documentation.html#dedupe.Dedupe.cluster for more details

fgregg
  • 3,173
  • 30
  • 37