Google prediction API - Building classifier training data

Question

EDIT: I'm trying to classify new user review to predefined set of tags. Each review can have multiple tags associated to it.

I've mapped my DB user reviews to 15 categories, The following example shows the text, reasoning the mapped categories

USER_REVIEWS | CATEGORIES
"Best pizza ever, we really loved this place, our kids ..." | "food,family"
"The ATV tour was extreme and the nature was beautiful ..." | "active,family"

pizza:food
our kids:family
The ATV tour was extreme:active
nature was beautiful:nature

EDIT: I tried 2 approaches of training data:

The first includes all categories in a single file like so:

"food","Best pizza ever, we really loved this place, our kids..."
"family","Best pizza ever, we really loved this place, our kids..."

The second approach was splitting the training data to 15 separate files like so:

family_training_data.csv:

"true" , "Best pizza ever, we really loved this place, our kids..."
"false" , "The ATV tour was extreme and the nature was beautiful ..."

Non of the above were conclusive, and missed tagging most of the times.

Here are some questions that came up, while I was experimenting:

Some of my reviews are very long (more than 300 words), should I limit the words on my training data file, so it will match the average review word count (80)?
Is it best to separate the data to 15 training data files, with TRUE/FALSE option, meaning: (is the review text of a specific category), or mix all categories in one training data file?
How can I train the model to find synonyms or related keywords, so it can tag "The motorbike ride was great" as active although the training data had a record for ATV ride

Iv'e tried some approaches as described above, without any good results.
Q: What training data format would give the best results?

You've given a very broad set of questions; I think this is beyond the StackOverflow range of application. As it stands, I don't think I can answer this. What specific problem are you trying to solve? What constitutes "good results"? What are your criteria for "best results"? Why do you want to *train* a model to a lexicon, when this is generally a directed task? — Prune, Oct 15 '15 at 00:55
Thank you for your reply, I'll try to elaborate. The problem I'm trying to solve is classifying reviews to predefined tags, at the moment the results I get are (most of the time) not conclusive, or missing tagging all together, good results would be a review being tagged right 80% of the times. Since I'm no expert in building training data, I came here with many uncertainties. — Shlomi Schwartz, Oct 15 '15 at 06:41
Regarding your questions 1. and 3., I think it may help to write code to preprocess your training examples, and your inputs. Your classification is primarily keyword-based, so programmatically filtering out articles, punctuation, etc., normalizing grammatical case, and potentially also constructing a synonym graph using some existing database (and including the associations in the training samples) will reduce noise-to-signal ratio. — Igor Raush, Oct 19 '15 at 22:06

score 2 · Answer 1 · answered Oct 19 '15 at 21:40

I'll start with the parts I can answer with the given information. Maybe we can refine your questions from there.

Question 3: You can't train a model to recognize a new vocabulary word without supporting context. It's not just that "motorbike" is not in the training set, but that "ride" is not in the training set either, and the other words in the review do not relate transportation. The cognitive information you seek is simply not in the data you present.

Question 2: This depends on the training method you're considering. You can give the each tag as a separate feature column with a true/false value. This is functionally equivalent to 15 separate data files, each with a single true/false value. The one-file method gives you the chance to later extend to some context support between categories.

Question 1: The length, itself, is not particularly relevant, except that cutting out unproductive words will help focus the training -- you won't get nearly as many spurious classifications from incidental correlations. Do you have a way to reduce the size programmatically? Can you apply that to the new input you want to classify? If not, then I'm not sure it's worth the effort.

OPEN ISSUES

What empirical evidence do you have that 80% accuracy is possible with the given data? If the training data do not contain the theoretical information needed to accurately tag that data, then you have no chance to get the model you want.

Does your chosen application have enough intelligence to break the review into words? Is there any cognizance of word order or semantics -- and do you need that?

I have no evidence that 80% ~ is possible, this just my goal, I was looking at Alchemy (http://www.alchemyapi.com/products/demo/alchemylanguage) especially at the taxonomy section for inspiration. What would be the proper way to add cognitive information to my training data? — Shlomi Schwartz, Oct 20 '15 at 12:12
There is no *one* proper way; it depends on the cognitive information you want to add and the design of the system you're building. Thanks for the Alchemy link; that's a lovely, sophisticated system. Do realize that this is a showcase piece for a complex, released product. Great inspiration, but a large project. — Prune, Oct 20 '15 at 16:55
I admit that I feel like this discussion is both misplaced (doesn't belong on SO) and lacks focus. My problem is that I don't know what you want as the outcome of your posting. You've asked several implementation-specific questions, but when I ask about higher-level concepts of this system, I get another question instead of a solid answer. So ... what stage is this project in? What is the objective of the project, and what are your available resources and time line? What do you have in the way of goals, objectives, requirements, and specifications? This helps me give useful feedback. — Prune, Oct 20 '15 at 17:01

score 0 · Answer 2 · answered Oct 25 '15 at 14:10

After facing similar problems, here are my insights regarding your questions:

According to WATSON Natural Language Classifier documentation it is best to limit the length of input text to fewer than 60 words, so I guess using your average 80 words will produce better results
You can go either way, but separate files will produce a more unambiguous results
creating a a synonym graph, as suggested would be a good place to start, WATSON is aimed to answer a more complex cognitive solution.

Some other helping tips from WATSON guidelines:

Limit the length of input text to fewer than 60 words.

Limit the number of classes to several hundred classes. Support for larger numbers of classes might be included in later versions of the service.

When each text record has only one class, make sure that each class is matched with at least 5 - 10 records to provide enough training on that class.

It can be difficult to decide whether to include multiple classes for a text. Two common reasons drive multiple classes:

When the text is vague, identifying a single class is not always clear.

When experts interpret the text in different ways, multiple classes support those interpretations.

However, if many texts in your training data include multiple classes, or if some texts have more than three classes, you might need to refine the classes. For example, review whether the classes are hierarchical. If they are hierarchical, include the leaf node as the class.

Google prediction API - Building classifier training data

2 Answers2