I'm working with a large dataset of products(~1 million). These products come from many different sources and thus the way they have data listed in inconsistent. One of the big issues is variances product Brand names (~17,000 unique brands). Some brands have as many as 10 variances that need to be related together.
Issues:
- Inconsistant Spacing: Jet Boil VS Jetboil
- Punctuation: Granger's VS Grangers
- Noise Words: The North Face VS North Face
- Taxomonies: Armada VS Armada Skis
- Symbols: Phil and Teds VS Phil&Teds
- Mis-spelling: Patagonia VS Pategonia
- Other Oddities: Bell Sports VS Bell Sports #81037
Example Dataset
Black Diamond
Black Diamond (Uda)
Black Diamond Co
Black Diamond Eq Ltd
Black Diamond Eqp #76800
Black Diamond Equipment
Black Dog Machine Llc
Black Dome Press
Black Dot
Black Dragon
Black Fire
Black Flys
Black Forest Girl
Black Gold
Black Hawk Inc.
Black Hills
Black Knight
Black Label
Black Magic
Black Marine
Black Market Bikes
Black Max
Black Opal
Black Ops
Black Rain Ordance Inc.
Black Rain Ordnance
Black Rapid
Black Ribbon
Black Rifle Disease Engineerin
Black River Bucks
Black Seal
Black Seed
Black Swan
Black Tower
Black Widow
Black's
Consequences (as suggested in a comment)
- An incorrect association will result in unrelated brands being displayed in product searches and thus weaken the usability of the presentation layer
- Missing an association will result in the same brand being displayed multiple in a filter list and thus weaken the usability of the presentation layer
I realize that is is a large problem and likely beyond the scope of what can be resolved in a stack overflow article, but I'm looking for inspirations on how to tackle this problem.
Any algorithm, software pattern, or process that may help is welcome.