6

I'm working with a large dataset of products(~1 million). These products come from many different sources and thus the way they have data listed in inconsistent. One of the big issues is variances product Brand names (~17,000 unique brands). Some brands have as many as 10 variances that need to be related together.

Issues:


  1. Inconsistant Spacing: Jet Boil VS Jetboil
  2. Punctuation: Granger's VS Grangers
  3. Noise Words: The North Face VS North Face
  4. Taxomonies: Armada VS Armada Skis
  5. Symbols: Phil and Teds VS Phil&Teds
  6. Mis-spelling: Patagonia VS Pategonia
  7. Other Oddities: Bell Sports VS Bell Sports #81037

Example Dataset


Black Diamond
Black Diamond (Uda)
Black Diamond Co
Black Diamond Eq Ltd
Black Diamond Eqp #76800
Black Diamond Equipment
Black Dog Machine Llc
Black Dome Press
Black Dot
Black Dragon
Black Fire
Black Flys
Black Forest Girl
Black Gold
Black Hawk Inc.
Black Hills
Black Knight
Black Label
Black Magic
Black Marine
Black Market Bikes
Black Max
Black Opal
Black Ops
Black Rain Ordance Inc.
Black Rain Ordnance
Black Rapid
Black Ribbon
Black Rifle Disease Engineerin
Black River Bucks
Black Seal
Black Seed
Black Swan
Black Tower
Black Widow
Black's

Consequences (as suggested in a comment)

  • An incorrect association will result in unrelated brands being displayed in product searches and thus weaken the usability of the presentation layer
  • Missing an association will result in the same brand being displayed multiple in a filter list and thus weaken the usability of the presentation layer

I realize that is is a large problem and likely beyond the scope of what can be resolved in a stack overflow article, but I'm looking for inspirations on how to tackle this problem.

Any algorithm, software pattern, or process that may help is welcome.

NSjonas
  • 10,693
  • 9
  • 66
  • 92
  • 1,2,3 and 5 seems easy. I did not understand 4. 6 can be fixed by a dictionary or ancyclopedia resource file(finding the closest match). 7: same occurences? easy. – huseyin tugrul buyukisik Sep 07 '12 at 20:16
  • 4. is the base brand name vs the brand + what they make. EX: LiquidLogic vs LiquidLogic Kayak's. They are the same brand and need to be related together. – NSjonas Sep 07 '12 at 20:20
  • Look at the dataset i just added. Complexity is added because you have brand Matches like Columbia vs Columbia River Sports, which in fact are 2 different brands – NSjonas Sep 07 '12 at 20:21
  • You can parse all different words then put inside a hashtable. Then start over again but this time you use your hashtable/hashmap to compare your parsed lines. Then, for each similar word you add another branch deeper. – huseyin tugrul buyukisik Sep 07 '12 at 20:25
  • 1
    A useful thing for you to know would be: 1, The consequences of missing a valid name match; and 2, The consequences of making a wrong match. I think these may affect any algorithm. – Paddy3118 Sep 08 '12 at 08:02
  • good point, I'll make an update – NSjonas Sep 11 '12 at 18:02

1 Answers1

3

Well, the way I would approach this would be to use some distance metric to quantify similarity between phrases and then cluster the terms by their distances.

You can start with a classical text metric like Levenshtein distance (you will find many implementations easily), which is basically the edit distance, or the number of operations you need to get from one string to another string, where an operation can be a substitution, insertion or deletion.

From the examples you gave, it seems that Levenshtein would be reasonable.

For clustering there are tons of algorithms, again this is easy to google and find tons of implementations. Clustering basically finds groups (clusters) of objects that are close to each other under a certain distance metric. In your case these would be groups of terms that are similar to each other.

Once you see the results, you can try playing with your distance metric a little bit by making manual adjustments using your knowledge about the data (like specifying that "&" is close to "and", etc).

Good luck!

Bitwise
  • 7,577
  • 6
  • 33
  • 50