N-grams for letter in sklearn

Question

I want to do n-grams method but letter by letter

Normal N-grams:

sentence : He want to watch football match

result:
he, he want, want, want to , to , to watch , watch , watch football , football, football match, match

I want to do this but letter by letter:

word : Angela 

result:
a, an, n , ng , g , ge, e ,el, l , la ,a

This is my code using Sklearn , but it is still word-by-word not letter-by-letter:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1, 100),token_pattern = r"(?u)\b\w+\b")

corpus = ['Angel','Angelica','John','Johnson']

X = vectorizer.fit_transform(corpus)
analyze = vectorizer.build_analyzer()
print(vectorizer.get_feature_names())
print(vectorizer.transform(['Angela']).toarray())

score 7 · Accepted Answer · answered Oct 29 '18 at 05:51

There is an 'analyzer' param which does what you want.

According to the documentation:-

analyzer : string, {‘word’, ‘char’, ‘char_wb’} or callable

Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.

By default, it is set to word, which you can change.

Just do:

vectorizer = CountVectorizer(ngram_range=(1, 100),
                             token_pattern = r"(?u)\b\w+\b", 
                             analyzer='char')

N-grams for letter in sklearn

1 Answers1

Linked