4

I am using scikit-learn supervised learning method for text classification. I have a training dataset with input text fields and the categories they belong to. I use tf-idf, SVM classifier pipeline for creating the model. The solution works well for normal testcases. But if a new text is entered which has synoynmous words as in the training set, the solution fails to classify correctly. For e.g: the word 'run' might be there in the training data but if I use the word 'sprint' to test, the solution fails to classify correctly.

What is the best approach here? Adding all synonyms for all words in training dataset doesn't look like a scalable approach to me

Shamy
  • 379
  • 3
  • 15

1 Answers1

7

You should look into word vectors and dense document embeddings. Right now you are passing scikit-learn a matrix X, where each row is a numerical representation of a document in your dataset. You are getting this representation with tf-idf but as you noticed this doesn't capture word similarities and you are also having issues with out of vocabulary words.

A possible improvement is to represent each word with a dense vector of lets say dimension 300, in such a way that words with similar meaning are close in this 300 dimensional space. Fortunately you don't need to build these vectors from scratch (look up gensim word2vec and spacy). Another good thing is that by using word embeddings pre-trained on very large corpus like Wikipedia you are incorporating a lot of linguistic information about the world into your algorithm that you couldn't infer from your corpus otherwise (like the fact that sprint and run are synonyms).

Once you get good and semantic numeric representation for words you need to get a vector representation for each document. The simplest way would be to average the word vectors of each word in the sentence.

Example pseudocode to get you started:

>>> import spacy

>>> nlp = spacy.load('en')
>>> doc1 = nlp('I had a good run')
>>> doc1.vector
array([  6.17495403e-02,   2.07064897e-02,  -1.56451517e-03,
         1.02607915e-02,  -1.30429687e-02,   1.60102192e-02, ...

Now lets try a different document:

>>> doc2 = nlp('I had a great sprint')
>>> doc2.vector
array([ 0.02453461, -0.00261007,  0.01455955, -0.01595449, -0.01795897,
   -0.02184369, -0.01654281,  0.01735667,  0.00054854, ...

>>> doc2.similarity(doc1)
0.8820845113100807

Note how the vectors are similar (in the sense of cosine similarity) even when the words are different. Because the vectors are similar, a scikit-learn classifier will learn to assign them to the same category. With a tf-idf representation this would not be the case.

This is how you can use these vectors in scikit-learn:

X = [nlp(text).vector for text in corpus]
clf.fit(X, y)
elyase
  • 39,479
  • 12
  • 112
  • 119
  • I tried your suggestion and was able to reproduce your result. But when I tested with "I had a great drink" it still gave me the high similarity of 0.88. Then I tested the similarity between 'run' and 'sprint' alone and I got 0.49. Can you please suggest if there are any enhancements I can do on your suggestion to make it perform better? Thanks – Shamy Oct 11 '16 at 13:48
  • As you noticed NLP is still an unsolved problem and it will never work perfectly. Still word vectors are an improvement because using tf-idf the similarity between sprint and run would be 0, which is even worse. You can try Brown clusters, synsets to deal with synonyms but most SOTA systems use word vectors and my suggestion would be that you keep trying to make them work. This is only the start and there are many things that you can improve like getting better vecs, better phrase reps, etc but I would suggest you make a new question because it is a lot of information for a comment. – elyase Oct 11 '16 at 14:49
  • Thanks @elyase I have posted a new question. Can you please answer that? – Shamy Oct 17 '16 at 16:30
  • Averaging the word vectors is also a rude approximation, it is a way to utilize this representation, but not ideal. There is quite a bit of research in this area, how to get better phrase representations, or even document representations. – Dr VComas Oct 24 '16 at 20:55
  • I disagree with @DrVComas assertion that word vector averaging is a crude approximation. Don't let yourself be fooled by its simplicity, it has proven competitive time and again even beating more complex CNNs and RNNs architectures for some tasks (see for example the [DAN](https://cs.umd.edu/~miyyer/pubs/2015_acl_dan.pdf), [paraphrasitic embeddings](https://arxiv.org/abs/1511.08198), the [fasttext](https://arxiv.org/abs/1607.01759) or more recently the [Siamese CBOW](https://arxiv.org/abs/1606.04640v1) paper. It is of course not a silver bullet but it is an effective and very strong baseline. – elyase Oct 25 '16 at 22:42
  • 1
    The problems you are having with the similarity of *sprint* and *run* most probably stems from the fact that you are using the same vector representation for *run* the verb and *run* the noun (this is called polysemy). The representation for `run|VERB` is probably the winning one because it occurs more frequently. The syntactic aspects of `run|NOUN` get lost and you get a low similarity score. I would suggest that you first annotate the words with POS tags and build embeddings for each different token a la [sense2vec](https://arxiv.org/abs/1511.06388). – elyase Oct 25 '16 at 22:51
  • You can take a look [here](https://explosion.ai/blog/sense2vec-with-spacy) for a good implementation using `spacy`. – elyase Oct 25 '16 at 22:54
  • Averaging is a way to use the words representation in some way. Intuitively a word representation has a meaning, once the average is calculated it loses the meaning or at least part of it. Should be better than BOW? sure, should be averaged? this is another question, if there is nothing else at hand it is good enough. The question is how to better represent a document. – Dr VComas Oct 31 '16 at 19:57
  • @DrVComas, That's the thing, it looks like in practice the loss of meaning caused by the averaging is not that relevant, many factors in play. In theory RNNs should work better because they model order and the compositional nature of language but that theoretical expectation doesn't translate well to practice. It is still not 100% clear why that's the case but there is a [recent paper](https://arxiv.org/pdf/1608.04207v2.pdf) from Yoav Goldberg with some interesting results. Paradoxically word averaging encodes order better than encoder-decoder architectures in some configurations. – elyase Oct 31 '16 at 23:40
  • @elyase it is kind of counter intuitive that there is little lost. Thanks for the papers. – Dr VComas Nov 01 '16 at 01:17