I have integer-type features in my feature vector that NLTK’s NaiveBayesClassifier
is treating as nominal values.
Context
I am trying to build a language classifier using n-grams. For instance, the bigram ‘th’ is more common in English than French.
For each sentence in my training set, I extract a feature as follows: bigram(th): 5
where 5 (example) represents the number of times the bigram ‘th’ appeared in the sentence.
When I try building a classifier with features like this and I check the most informative features, I realize that the classifier does not realize that such features are linear. For example, it might consider bigram(ea): 4
as French, bigram(ea): 5
as English and bigram(ea): 6
as French again. This is quite arbitrary and does not represent the logic that a bigram is either more common in English or in French. This is why I need the integers to be treated as such.
More thoughts
Of course, I could replace these features with features such as has(th): True
. However, I believe this is a bad idea because both a French sentence with 1 instance of 'th' and an English sentence with 5 instances of 'th' will have the feature has(th): True
which cannot differentiate them.
I also found this relevant link but it did not provide me with the answer.
Feature Extractor
My feature extractor looks like this:
def get_ngrams(word, n):
ngrams_list = []
ngrams_list.append(list(ngrams(word, n, pad_left=True, pad_right=True, left_pad_symbol='_', right_pad_symbol='_')))
ngrams_flat_tuples = [ngram for ngram_list in ngrams_list for ngram in ngram_list]
format_string = ''
for i in range(0, n):
format_string += ('%s')
ngrams_list_flat = [format_string % ngram_tuple for ngram_tuple in ngrams_flat_tuples]
return ngrams_list_flat
# Feature extractor
def get_ngram_features(sentence_tokens):
features = {}
# Unigrams
for word in sentence_tokens:
ngrams = get_ngrams(word, 1)
for ngram in ngrams:
features[f'char({ngram})'] = features.get(f'char({ngram})', 0) + 1
# Bigrams
for word in sentence_tokens:
ngrams = get_ngrams(word, 2)
for ngram in ngrams:
features[f'bigram({ngram})'] = features.get(f'bigram({ngram})', 0) + 1
# Trigrams
for word in sentence_tokens:
ngrams = get_ngrams(word, 3)
for ngram in ngrams:
features[f'trigram({ngram})'] = features.get(f'trigram({ngram})', 0) + 1
# Quadrigrams
for word in sentence_tokens:
ngrams = get_ngrams(word, 4)
for ngram in ngrams:
features[f'quadrigram({ngram})'] = features.get(f'quadrigram({ngram})', 0) + 1
return features
Feature Extraction Example
get_ngram_features(['test', 'sentence'])
Returns:
{'char(c)': 1,
'char(e)': 4,
'char(n)': 2,
'char(s)': 2,
'char(t)': 3,
'bigram(_s)': 1,
'bigram(_t)': 1,
'bigram(ce)': 1,
'bigram(e_)': 1,
'bigram(en)': 2,
'bigram(es)': 1,
'bigram(nc)': 1,
'bigram(nt)': 1,
'bigram(se)': 1,
'bigram(st)': 1,
'bigram(t_)': 1,
'bigram(te)': 2,
'quadrigram(_sen)': 1,
'quadrigram(_tes)': 1,
'quadrigram(ence)': 1,
'quadrigram(ente)': 1,
'quadrigram(est_)': 1,
'quadrigram(nce_)': 1,
'quadrigram(nten)': 1,
'quadrigram(sent)': 1,
'quadrigram(tenc)': 1,
'quadrigram(test)': 1,
'trigram(_se)': 1,
'trigram(_te)': 1,
'trigram(ce_)': 1,
'trigram(enc)': 1,
'trigram(ent)': 1,
'trigram(est)': 1,
'trigram(nce)': 1,
'trigram(nte)': 1,
'trigram(sen)': 1,
'trigram(st_)': 1,
'trigram(ten)': 1,
'trigram(tes)': 1}