0

I tagged a dataset of texts with independent categories. When running a CNN classifier in Keras, I receive an accuracy of > 90%.

My texts are customer reviews "I really liked the camera of this phone." Classes are e.g. "phone camera", "memory" etc.

What I am searching for is whether I can tag the sentences with categories that appear in them while the classifier marks the entities that indicate the class. Or more specifically: How can I extract those parts of the input sentence that made a CNN network in Keras opt (i.e. classify) for 1, 2 or more categories?

junkmaster
  • 141
  • 1
  • 11
  • Do you want to assign a label (i.e/ good/bad/normal) to entities like camera, memory etc, or just mark a text with categories - i.e. for "I really liked the camera of this phone." - there will be labels "camera", "phone" ? – Mikhail Stepanov Dec 18 '18 at 09:39
  • At first, I want to find out which categories are in a sentence. Later, I want to classify several things such as the sentiment (good/bad/normal) for every category, but also possibly other things such as importance, subjectivity etc. As there are oftentimes several categories in a sentence, I cannot classify an overall sentiment or importance. – junkmaster Dec 18 '18 at 09:46
  • SoI just have a same problem with a customer reviews processing. As far as I know, there's no readymade NN architecture which can do this task. I use text preprocessing (split sentences to separate entities), and then labeling dataset by hand with labels good/bad/norm/neutral etc, then train classifiers. If this approach is suitable for you, I can write it out like and answer with general pipeline. – Mikhail Stepanov Dec 18 '18 at 09:53
  • That would help. Thanks! Perhaps, someone else can help us with another solution later. – junkmaster Dec 18 '18 at 10:03

1 Answers1

1

My pipilene (in general) for similar task.

I don't use nn to solve a whole task

First, I don't use NNs directly to label separate entities like "camera", "screen" etc. There's some good approaches which might be useful, like a pointer networks or just attention, but it just didn't wotk in my case.
I guess, this architectures don't work well because there are a lot of noise, aka "I'm so glad I bought this TV" or so in my dataset. Approx. 75% overall, and the rest of the data is not so clean, to.

Because of this, I do some additional actions:

  1. Split sentences into chunks (sometimes they contatins desired entities)
  2. Label this chunks by hand into "non-useful" (aka "I'm so happy/so upset" etc.) and useful: "good camera", "bad phone" etc.
  3. Train classifier(s) to classify this data.

Details about a pipeline

How to "recognize" entities
I just used regexps and part-of-speech tags to split my data. But I work with russian language dataset, so there's not good free syntax parser / library for russian. If you work with english or another language, well-presented in spacy or nltk libraries, you can use it for parsing to separate entities. Also, english grammar is so strict in contrast to russian - it's make your task easier probably.
Anyway, try to start with regexes and parsing.

Vocabularies with keywords for topics like "camera", "battery", ... are very helpful, too.

Another approach to recognize entities is topic modellig - PLSA/LDA (gensim rocks), but it's hard to tune, imo, because there are lot of noise in texts. You'll get a lot of topics {"happy", "glad", "bought", "family", ...} and so on - but you can try topic modelling anyway.

Also you can create a dataset with an entities labels for each text and train a NN with attention, so you can recognize it by high attention, but create this dataset is very tedious.

Create dataset and train NN's
I start to create dataset only when I've got acceptable quality of "named entities" - because if you change this (footing) part later, you probalby can throw away a dataset and start it from scratch again.

Better decide which labels you will use once and then don't change them - it's critical part of work.

Training NN's on such data is easiest part of the work probably - just any good classifier, as for the whole texts. Even not a nn, but a simplier calssifiers might be useful - use blending, bagging etc.

Possible troubles
There's a trap - some reviews / features not so obvious for NN classifier or even for a human, like "loud sound" or "gets very hot". Often they context-depentend. So, I use a little help of our team to mark a dataset - so, each entry was labeld by a group of humans to get better quality. Also I use context labels - category of a product - adding a context for each entity: so, "loud sound" for audio system and for washing mashing bears controversal sentiment and model can learn it. Most cases category labels easy accessable throug databases/web parsing.

Hope it helps, also I hope someone knows a better approach.

Community
  • 1
  • 1
Mikhail Stepanov
  • 3,680
  • 3
  • 23
  • 24