0

I have an “entity resolution” type of use case, where I have several (< 100) device features available for many (a few millions of) devices. My goal is to generate ids for these devices. The challenge is that the same device might have two or more slightly different representations, but I still want to assign the same device id to all of them.

I want your recommendation in this regard:

  1. What kind of feature pre-processing should I apply?
  2. Which algorithms will be best for my purpose?
  3. Please do mention if there are standard implementations of such algorithms.

Thanks and regards,

PTDS
  • 217
  • 3
  • 8
  • One of the more important questions: ```is there a function, which can decide if two devices are the same (despite having different representations)```? If not, all you can do is something based on clustering (which might be hard to tune; and in general: clustering is a hard problem). – sascha Aug 12 '16 at 19:29
  • @sascha How do I find out? – PTDS Aug 12 '16 at 19:31
  • 1
    @PTDS Well, this is a question dependent on your setting. It's not really about the existence of this function (there is always one), but more about you knowing it. But it seems you don't have this function and need to stick to clustering. But depending on the data, tuning is hard (e.g. how many clusters are expected or how distant are devices with the same id + what kind of distance-metric). Without much more information about the data & statistics, not much can be done. – sascha Aug 12 '16 at 19:36

0 Answers0