How to use bigrams + trigrams + word-marks vocabulary in countVectorizer?

Question

I'm using text classification with naive Bayes and countVectorizer to classify dialects. I read a research paper that the author has used a combination of :

bigrams + trigrams + word-marks vocabulary

He means by word-marks here, the words that are specific to a certain dialect.

How can I tweak those parameters in countVectorizer?

word marks

So those are examples of word marks, but it isn't what I have, because mine are arabic. So I translated them.

word_marks=['love', 'funny', 'happy', 'amazing']

Those are used to classify a text.

Also, in the this post: Understanding the `ngram_range` argument in a CountVectorizer in sklearn

There was this answer :

>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"})
>>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray()
array([[1, 1]])  # unigram and bigram found

I couldn't understand the output, what does [1,1] mean here? and how was he able to use ngram with vocabulary? aren't both of them mutually exclusive?

Do you want to use a single "countVectorizer" to get all three bigrams, trigrams and word-marks vocabulary ? And do you have a dictionary of words per dialect? Please add an example with input and expected output — a'-, May 10 '19 at 19:58
Yes, I want to use all three together and I do have a dictionary of words per dialect — John Sall, May 11 '19 at 10:42
array([1,1,]) mean that the CountVectorizer found one instance of "keeps" and one instance of "keeps the" in the input sentence. The code is only looking for matches in the custom vocabulary. Try changing the input sentence to add more instances of "keeps" and "keeps the" to see how the output count changes. — Adnan S, May 12 '19 at 18:07

Adnan S · Accepted Answer · 2019-05-12T18:09:41.040

2

You want to use the n_gram range argument to use bigrams and trigrams. In your case, it would be CountVectorizer(ngram_range=(1, 3)).

See the accepted answer to this question for more details.

Please provide example of "word-marks" for the other part of your question.

You may have to run CountVectorizer twice - once for n-grams and once for your custom word-mark vocabulary. You can then concatenate the two outputs from the two CountVectorizers to get a single feature set of n-gram counts and custom vocabulary counts. The answer to the above question also explains how to specify a custom vocabulary for this second use of CountVectorizer.

Here's a SO answer on concatenating arrays

edited May 12 '19 at 18:09

answered May 11 '19 at 00:41

Adnan S

1,852
1
14
19

I have edited the post, please see the what I added. Also, can you please provide an example of how I can concatenate two vectors after fitting them? – John Sall May 12 '19 at 13:57
Is this possible to do when I have a pipeline? – John Sall May 16 '19 at 09:14
also, do I concatenate at the fit stage or transform stage? and how will I be able to predict? – John Sall May 16 '19 at 09:26

How to use bigrams + trigrams + word-marks vocabulary in countVectorizer?

word marks

1 Answers1

Linked