EDIT: I'm trying to classify new user review to predefined set of tags. Each review can have multiple tags associated to it.
I've mapped my DB user reviews to 15 categories, The following example shows the text, reasoning the mapped categories
USER_REVIEWS | CATEGORIES
"Best pizza
ever, we really loved this place, our kids
..." | "food,family"
"The ATV tour was extreme
and the nature was beautiful
..." | "active,family"
pizza:food
our kids:family
The ATV tour was extreme:active
nature was beautiful:nature
EDIT: I tried 2 approaches of training data:
The first includes all categories in a single file like so:
"food","Best pizza ever, we really loved this place, our kids..."
"family","Best pizza ever, we really loved this place, our kids..."
The second approach was splitting the training data to 15 separate files like so:
family_training_data.csv:
"true" , "Best pizza ever, we really loved this place, our kids..."
"false" , "The ATV tour was extreme and the nature was beautiful ..."
Non of the above were conclusive, and missed tagging most of the times.
Here are some questions that came up, while I was experimenting:
- Some of my reviews are very long (more than 300 words), should I limit the words on my training data file, so it will match the average review word count (80)?
- Is it best to separate the data to 15 training data files, with TRUE/FALSE option, meaning: (is the review text of a specific category), or mix all categories in one training data file?
- How can I train the model to find synonyms or related keywords, so it can tag "The
motorbike
ride was great" asactive
although the training data had a record forATV
ride
Iv'e tried some approaches as described above, without any good results.
Q: What training data format would give the best results?