5

I have a catalog of electronic products. I have them in a SQL DB in fields/columns like Title, Mfg Part Nr, UPC etc. I then crawl through external websites that list electronic products for e.g. Amazon. For most part this results in some HTML text, though I can figure out the Title for example. I need to compare if this HTML text (the result of a webpage on an external website) describes a product I have.

I understand that this comparison would not be exact i.e. I am not expecting this to correct 100% of the time. Is there anyway to do this?

While it would be difficult to provide a complete example, let us limit the comparison to just the Titles of two products.

Title I have: Motorola Talkabout MH230R Portable - two-way radio - FRS/GMRS 22-channel - yellow ( pack of 3 )

Amazon’s Title: Motorola MH230TPR Giant Rechargeable Two Way Radio 3 Pack, FRS/GMRS

These represent the same products. Is there any way determine if these are similar/same? A simple text comparison would not do.

It would be great if there are tools out there to handle this problem. If not I’d appreciate the algorithm or some pointers which I could use to research this area further.

I know C# and Java. I have used a bit of AI/Neural Networks in relation to numerical analysis – particularly Back Propagation and Genetic Algorithm – in comparing images and finding optimal points. I however have no clue how to handle text data.

Please let me know if this question is unclear, and I would try to clarify my description. Thank you all.

O.O.
  • 1,973
  • 6
  • 28
  • 40

1 Answers1

1

There is of course a lot of algorithms out there that deal with text similarity and distance measures in strings (for a short list of them look in wikipedia). Here are some ideas how to approach that problem more specifically:

  • set up a dictionary with brand names and give it a high weight in your overall similarity function when two product strings share the same brand name.
  • give it a high similarity value if longer numbers match up.
  • normalize the input text strings in a way to get rid of hyphens and stuff.
  • use more than one similarity measure.

Generally you get better results if you manage to put your knowledge about such strings into the code you write, instead of using general approaches... But then, since you come from an AI/Neural net background... you could find out what makes strings similar by machine learning techniques if you generate useful descriptors of your input strings. For that you need a sufficiently large base of already correctly assigned matching product strings.

but maybe you need something very simple? Then look into agrep

luksch
  • 11,497
  • 6
  • 38
  • 53
  • Thank you very much. I have set up heuristics like the one you suggest. This works to an extent, but I would like to know if AI can better solve this problem. While I have some AI experience, I have no clue how to handle text and language. – O.O. May 07 '13 at 21:46
  • Your solution seems to like to brand names, but it does not work for me because some brands like HP can be ambiguous e.g. Hewlett-Packard, H-P, etc. Another example is that Linksys and Cisco refer to same thing. Putting this information into rules is too expensive. I have a system to match unresolved matches manually. I think the AI system could learn from this data, so that I do not have to put in the rules. – O.O. May 07 '13 at 21:46
  • I just hinted... I think machine learning techniques can indeed help in finding relevant descriptors of your problem set, but there will be always some left over uncertainty which has to be solved manually. I would think that matching brand names is still a good idea, when you start to allow H-P and Hewlett-Packard to refer to the same brand. this is a huge task you are trying, and there is companies out there who sell unified product catalogues like you want to create for good money. I happen to know such a comapny, and they are in this business for years. They can't do without manual work. – luksch May 07 '13 at 22:07
  • nltk is a library to process natural language in python. You can compare used garmmar with it if you want. In my current project I try to predict gender of twitterposts using machinelearning. Imho machine learning will improve your results, but do not expect some totally awesome results. – Coding Jun 06 '17 at 11:06