0

I want to organise objects ( books ) into groups ( works ). The data I have to test for membership is title and author.

Often the title and author are formatted slightly differently, such as "Firstname Lastname" or "Lastname. Firstname". Sometimes titles contain the format ( "Paperback", "Hardcover", "eBook").

Sometimes a group may contain a objects that doesn't belong. Sometimes a group may contain many wrong products. I'm not expecting to be able get this 100% correct.

My first thought was a Bayes classifier with just a single category trained from the group, then used to classify membership from the score of the book. After testing this out, I think it's not such a great idea.

My next thought was to use words in the title and author and create a vector. Then calculate the distance from the group's vector from the object's vector to determine group membership. I've had a look at the rb-libsvm gem (I'll be using Ruby) which looks promising?

Or is there some other way cluster / classify these books into the groups?

Amit
  • 45,440
  • 9
  • 78
  • 110
dkam
  • 3,876
  • 2
  • 32
  • 24
  • Interesting ideas! I would like to know why didn't the Bayes classifier work out? I mean, what 'wrong' results did you get? It should have worked, by my thinking.. – Tejash Desai Aug 14 '16 at 05:12
  • Do you have a fixed set of classes (groups) that you want to classify the books into? If so, then it sounds like a good candidate for a Bayes classifier. You can do some optimisations to the general scheme given that you have two separate information "contexts" (title, author) that you want to use for classifying. – chris Aug 16 '16 at 12:50
  • Thanks for the comments! My concern with using Bayes was that some of the groups include incorrect data - I had a feeling that Bayes wouldn't handle that well ( it would assign the book to a group with a matching, but incorrect book in the group). I'll go ahead and do some more rigorous testing using Bayes. The groups already exist - but it is the case that sometimes new groups need to be created. – dkam Aug 17 '16 at 23:52

0 Answers0