0

I'm using text classification with naive Bayes and countVectorizer to classify dialects. I read a research paper that the author has used a combination of :

bigrams + trigrams + word-marks vocabulary 

He means by word-marks here, the words that are specific to a certain dialect.

How can I tweak those parameters in countVectorizer?

word marks

So those are examples of word marks, but it isn't what I have, because mine are arabic. So I translated them.

word_marks=['love', 'funny', 'happy', 'amazing']

Those are used to classify a text.

Also, in the this post: Understanding the `ngram_range` argument in a CountVectorizer in sklearn

There was this answer :

>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"})
>>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray()
array([[1, 1]])  # unigram and bigram found

I couldn't understand the output, what does [1,1] mean here? and how was he able to use ngram with vocabulary? aren't both of them mutually exclusive?

John Sall
  • 1,027
  • 1
  • 12
  • 25
  • Do you want to use a single "countVectorizer" to get all three bigrams, trigrams and word-marks vocabulary ? And do you have a dictionary of words per dialect? Please add an example with input and expected output – a'- May 10 '19 at 19:58
  • Yes, I want to use all three together and I do have a dictionary of words per dialect – John Sall May 11 '19 at 10:42
  • array([1,1,]) mean that the CountVectorizer found one instance of "keeps" and one instance of "keeps the" in the input sentence. The code is only looking for matches in the custom vocabulary. Try changing the input sentence to add more instances of "keeps" and "keeps the" to see how the output count changes. – Adnan S May 12 '19 at 18:07

1 Answers1

2

You want to use the n_gram range argument to use bigrams and trigrams. In your case, it would be CountVectorizer(ngram_range=(1, 3)).

See the accepted answer to this question for more details.

Please provide example of "word-marks" for the other part of your question.

You may have to run CountVectorizer twice - once for n-grams and once for your custom word-mark vocabulary. You can then concatenate the two outputs from the two CountVectorizers to get a single feature set of n-gram counts and custom vocabulary counts. The answer to the above question also explains how to specify a custom vocabulary for this second use of CountVectorizer.

Here's a SO answer on concatenating arrays

Adnan S
  • 1,852
  • 1
  • 14
  • 19