0

Let's say I have a database of books that includes their titles. For a given listing from eBay or Craigslist or some other such site, I want to compare its title string to all of the book titles in my database to try to find a match.

It's unlikely there will ever be exact string equality as users on those sites like to include things like "perfect condition" and "fast shipping" to their listing titles to attract buyers.

What algorithm(s) should I use to do this type of correlation? I'm aware of n-grams and Levenshtein distance, but I don't know which would do the most accurate job.

For the various applicable algorithms, how does their computational performance compare? Would it make sense to use multiple algorithms and average their results to balance their strengths and weaknesses? Would it be possible to set a minimum level of confidence? I'd rather have no match than a very poor quality match.

1 Answers1

0

For the task at hand, I think you'd get best results with some pre-processing: remove common "null" phrases (those you don't want to see), such that you have a smaller title that is likely to have the actual title as a major part.

The next step depends on your DB size and request overhead. If those are inexpensive, then pull a list of titles from your DB, and see which exists in the eBay text (a single command in many languages). If that works for you, then even that pre-processing is likely unnecessary overhead.

If the full DB listing is expensive, but the DB is indexed well, then try grabbing likely n-grams (say, 2-3 words) from the eBay text, and searching for them in the DB. You should get relatively few return values, which you can then try in toto against the full eBay text for a match.

Prune
  • 76,765
  • 14
  • 60
  • 81
  • I don't know about that. I doubt _in toto_ would ever actually work without further processing like changing everything to the same case and stripping leading and following spaces. Also replacing multiple adjacent spaces with a single space. But that's the problem I'm trying to solve with an algorithm. I should just be able to find a reasonably close match in most cases. What algorithm best defines "closeness" for my use case? –  Jan 14 '18 at 00:22
  • How about LCS -- Longest Common Substring. That has certainly been solved enough times on SO and elsewhere on line. – Prune Jan 14 '18 at 04:10