I want to organise objects ( books ) into groups ( works ). The data I have to test for membership is title and author.
Often the title and author are formatted slightly differently, such as "Firstname Lastname" or "Lastname. Firstname". Sometimes titles contain the format ( "Paperback", "Hardcover", "eBook").
Sometimes a group may contain a objects that doesn't belong. Sometimes a group may contain many wrong products. I'm not expecting to be able get this 100% correct.
My first thought was a Bayes classifier with just a single category trained from the group, then used to classify membership from the score of the book. After testing this out, I think it's not such a great idea.
My next thought was to use words in the title and author and create a vector. Then calculate the distance from the group's vector from the object's vector to determine group membership. I've had a look at the rb-libsvm gem (I'll be using Ruby) which looks promising?
Or is there some other way cluster / classify these books into the groups?