0

I'm beginning to work on my ML course's project which is to classify a scientific text and label it as if its topic is "A" or not. The problem I'm having is that they have provided me with a limited data set. Usually, scientific texts make use of complex and irregular words which don't exist normally in pre-trained word2vec models like Google news or Twitter, and these words weigh a lot in terms of the meaning of the texts. So I was wondering, what could I do to use these pre-trained models and predict what the new words mean?

Farhood
  • 391
  • 2
  • 4
  • 16
  • Look into using support vector machines, which are a good choice for building a binary classifier on text documents. – Tim Biegeleisen Jul 11 '17 at 05:46
  • @TimBiegeleisen I'm currently using SVM, but the problem is there are lots of important words I'm missing because they don't exist in the pre-trained models which would give me way more accuracy if I could predict their meaning somehow. – Farhood Jul 11 '17 at 05:49
  • You can train your own word2Vec using your dataset, but the initial value of your model is the pre-trained model value. – Someone Jul 11 '17 at 05:56

1 Answers1

1

So, don't use pre-trained models. Not only will they be missing domain words, but even with the words that are shared, the senses of words as most used in 'news articles' or 'Twitter' may not match your domain.

It's not hard to train your own word-vectors, or other doc-vectors, using the domain of interest as your training data.

A followup paper to the original 'Paragraph Vectors' paper, "Document Embedding With Paragraph Vectors", specifically evaluates Paragraph Vectors (in the PV-DBOW variant) in a topic-sensitive way. For pairs of Wikipedia articles with the same editor-assigned 'category', it checks if PV-DBOW places that pair closer to each other than some randomly chosen third article. It performs a similar check on 886,000 Arxiv papers.

Even if you have a small dataset, you might be able to use a similar technique. And even if the exercise provides a small dataset, perhaps other public datasets with similar vocabularies can be used to thicken your model.

(The PV-DBOW mode used in the above paper, adding word-training to doc-vector training, is analogous to the Doc2Vec class in the Python gensim library using options dm=0, dbow_words=1.)

gojomo
  • 52,260
  • 14
  • 86
  • 115