Find similar items in a dataset

Question

I have a dataset of of 500 mobile devices having 10 attributes namely

Date|Company|ModelName|Price|HardDisk|RAM|Colour|Display size|Cam1|Cam2

The sample dataset is given below :

24/10/2015   |   walmart   |  Samsung Galaxy Note 4 N910H 32GB Unlocked GSM OctaCore Cell Phone-N910H 32GB GOLD   |   599.99  |   32   |  N/A   |  cell gold             |  N/A   | 10.2 | 16   
25/10/2015  |  walmart  |  Samsung Galaxy Note 5  SM-N920i Gold International Model Unlocked GSM Mobile Phone    |  717.95  |  32   |   N/A   |  gold  |    N/A   |  5.7    |   16  
26/10/2015  |  amazon  |   T-Mobile AllShare Cast Wireless Hub   |   65.15  |   N/A |  N/A  |  streaming    |   N/A  |  N/A   |  N/A

I have to find the the most similar or unique devices or remove duplicate mobile devices from the dataset by taking into account the various attributes of the mobile devices.

I have explored many similarity algorithms like Jaccard similarity, cosine similarity. Levenshtein Distance but they seem to work upon attributes with same datatype.

Please suggest an algorithm or approach that could work on this type of mixed datatype dataset taking into account almost all attributes.

score 1 · Answer 1 · answered Dec 09 '15 at 22:42

You can compute the hash code of each row.

Then use the difference of the hash codes as similarity measure.

Obviously, this depends on all the attributes.

It is very good for finding duplicates!

It may not be good for your application - but you did not specify what is good for your application.

Find similar items in a dataset

1 Answers1