1

I have integer-type features in my feature vector that NLTK’s NaiveBayesClassifier is treating as nominal values.

Context

I am trying to build a language classifier using n-grams. For instance, the bigram ‘th’ is more common in English than French.

For each sentence in my training set, I extract a feature as follows: bigram(th): 5 where 5 (example) represents the number of times the bigram ‘th’ appeared in the sentence.

When I try building a classifier with features like this and I check the most informative features, I realize that the classifier does not realize that such features are linear. For example, it might consider bigram(ea): 4 as French, bigram(ea): 5 as English and bigram(ea): 6 as French again. This is quite arbitrary and does not represent the logic that a bigram is either more common in English or in French. This is why I need the integers to be treated as such.

More thoughts

Of course, I could replace these features with features such as has(th): True. However, I believe this is a bad idea because both a French sentence with 1 instance of 'th' and an English sentence with 5 instances of 'th' will have the feature has(th): True which cannot differentiate them.

I also found this relevant link but it did not provide me with the answer.

Feature Extractor

My feature extractor looks like this:

def get_ngrams(word, n):
    ngrams_list = []
    ngrams_list.append(list(ngrams(word, n, pad_left=True, pad_right=True, left_pad_symbol='_', right_pad_symbol='_')))
    ngrams_flat_tuples = [ngram for ngram_list in ngrams_list for ngram in ngram_list]
    format_string = ''
    for i in range(0, n):
        format_string += ('%s')
    ngrams_list_flat = [format_string % ngram_tuple for ngram_tuple in ngrams_flat_tuples]
    return ngrams_list_flat

# Feature extractor
def get_ngram_features(sentence_tokens):
    features = {}
    # Unigrams
    for word in sentence_tokens:
        ngrams = get_ngrams(word, 1)
        for ngram in ngrams:
            features[f'char({ngram})'] = features.get(f'char({ngram})', 0) + 1
    # Bigrams
    for word in sentence_tokens:
        ngrams = get_ngrams(word, 2)
        for ngram in ngrams:
            features[f'bigram({ngram})'] = features.get(f'bigram({ngram})', 0) + 1
    # Trigrams
    for word in sentence_tokens:
        ngrams = get_ngrams(word, 3)
        for ngram in ngrams:
            features[f'trigram({ngram})'] = features.get(f'trigram({ngram})', 0) + 1
    # Quadrigrams
    for word in sentence_tokens:
        ngrams = get_ngrams(word, 4)
        for ngram in ngrams:
            features[f'quadrigram({ngram})'] = features.get(f'quadrigram({ngram})', 0) + 1
    return features

Feature Extraction Example

get_ngram_features(['test', 'sentence'])

Returns:

{'char(c)': 1,
 'char(e)': 4,
 'char(n)': 2,
 'char(s)': 2,
 'char(t)': 3,
 'bigram(_s)': 1,
 'bigram(_t)': 1,
 'bigram(ce)': 1,
 'bigram(e_)': 1,
 'bigram(en)': 2,
 'bigram(es)': 1,
 'bigram(nc)': 1,
 'bigram(nt)': 1,
 'bigram(se)': 1,
 'bigram(st)': 1,
 'bigram(t_)': 1,
 'bigram(te)': 2,
 'quadrigram(_sen)': 1,
 'quadrigram(_tes)': 1,
 'quadrigram(ence)': 1,
 'quadrigram(ente)': 1,
 'quadrigram(est_)': 1,
 'quadrigram(nce_)': 1,
 'quadrigram(nten)': 1,
 'quadrigram(sent)': 1,
 'quadrigram(tenc)': 1,
 'quadrigram(test)': 1,
 'trigram(_se)': 1,
 'trigram(_te)': 1,
 'trigram(ce_)': 1,
 'trigram(enc)': 1,
 'trigram(ent)': 1,
 'trigram(est)': 1,
 'trigram(nce)': 1,
 'trigram(nte)': 1,
 'trigram(sen)': 1,
 'trigram(st_)': 1,
 'trigram(ten)': 1,
 'trigram(tes)': 1}
hb20007
  • 515
  • 1
  • 9
  • 23
  • Did you try turning the occurrences into nominal features by categorizing them compared to a threshold? (i.e. th<5, th>=5) – KonstantinosKokos Apr 01 '18 at 17:32
  • 1
    @KonstantinosKokos That's not ideal because it's difficult to determine the best thresholds for all n-grams – hb20007 Apr 01 '18 at 17:36
  • @KonstantinosKokos Your name sounds Greek. Actually, the exact problem I'm working on involves the Greek language. You can take a look here: https://github.com/hb20007/greek-dialect-classifier/blob/master/3-Building-the-Classifier.ipynb – hb20007 Apr 01 '18 at 17:37
  • 1
    That looks fun. I don't know how stuck you are with naive Bayes, but I'm pretty certain other simple learning algorithms such as random forest will do the threshold picking on their own. – KonstantinosKokos Apr 01 '18 at 17:58
  • @KonstantinosKokos Would decision trees work for this kind of problem? The total number of possible uni, bi, tri and quadrigrams in a language is massive. I can't imagine a decision tree being constructed with all of these. – hb20007 Apr 01 '18 at 18:04
  • Applying some sort of prefiltering would definitely help (i.e. same or similar occurrence counts between languages could be dismissed) – KonstantinosKokos Apr 01 '18 at 18:07
  • @KonstantinosKokos That's possible, I could focus on the most discriminating n-grams. I wonder if it's the best way to go about doing this... – hb20007 Apr 01 '18 at 18:10
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/168002/discussion-between-konstantinoskokos-and-hb20007). – KonstantinosKokos Apr 01 '18 at 18:13
  • Could you post your feature extraction function in the question? – alvas Apr 01 '18 at 23:41
  • @alvas Just edited the question to add my feature extractor – hb20007 Apr 02 '18 at 06:54
  • Honestly, that's some nutty extractor =) – alvas Apr 03 '18 at 08:49
  • 1
    As well as/instead of featurizing for each possible value n of `bigram(xx): n` in a sentence, just compute the character density: 2*number of bigrams 'xx' / number of chars in sentence. And yes, random forests will discover both the features and the threshold values which give maximal improvement in the impurity metric. – smci Apr 03 '18 at 09:39

1 Answers1

1

TL;DR

It's easier to use other libraries for this purpose. It's easier to do something like this https://www.kaggle.com/alvations/basic-nlp-with-nltk with sklearn using a custom analyzer, e.g. CountVectorizer(analyzer=preprocess_text)

For example:

from io import StringIO
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from nltk import everygrams

def sent_process(sent):
    return [''.join(ng) for ng in everygrams(sent.replace(' ', '_ _'), 1, 4) 
            if ' ' not in ng and '\n' not in ng and ng != ('_',)]

sent1 = "The quick brown fox jumps over the lazy brown dog."
sent2 = "Mr brown jumps over the lazy fox."
sent3 = 'Mr brown quickly jumps over the lazy dog.'
sent4 = 'The brown quickly jumps over the lazy fox.'

with StringIO('\n'.join([sent1, sent2])) as fin:
    # Override the analyzer totally with our preprocess text
    count_vect = CountVectorizer(analyzer=sent_process)
    count_vect.fit_transform(fin)
count_vect.vocabulary_ 


train_set = count_vect.fit_transform([sent1, sent2])

# To train the classifier
clf = MultinomialNB() 
clf.fit(train_set, ['pos', 'neg']) 

test_set = count_vect.transform([sent3, sent4])
clf.predict(test_set)

Cut-away

Firstly, there's really no need to explicitly label the char(...), unigram(...), bigram(...), trigram(...) and quadrigram(...) part to the features.

The feature set are just dictionary keys and you can use the actual ngram tuple as the keys, e.g.

from collections import Counter
from nltk import ngrams, word_tokenize

features = Counter(ngrams(word_tokenize('This is a something foo foo bar foo foo sentence'), 2))

[out]:

>>> features
Counter({('This', 'is'): 1,
         ('a', 'something'): 1,
         ('bar', 'foo'): 1,
         ('foo', 'bar'): 1,
         ('foo', 'foo'): 2,
         ('foo', 'sentence'): 1,
         ('is', 'a'): 1,
         ('something', 'foo'): 1})

As for ngrams of several orders, you can use everygrams(), e.g.

from nltk import everygrams

sent = word_tokenize('This is a something foo foo bar foo foo sentence')
Counter(everygrams(sent, 1, 4))

[out]:

Counter({('This',): 1,
         ('This', 'is'): 1,
         ('This', 'is', 'a'): 1,
         ('This', 'is', 'a', 'something'): 1,
         ('a',): 1,
         ('a', 'something'): 1,
         ('a', 'something', 'foo'): 1,
         ('a', 'something', 'foo', 'foo'): 1,
         ('bar',): 1,
         ('bar', 'foo'): 1,
         ('bar', 'foo', 'foo'): 1,
         ('bar', 'foo', 'foo', 'sentence'): 1,
         ('foo',): 4,
         ('foo', 'bar'): 1,
         ('foo', 'bar', 'foo'): 1,
         ('foo', 'bar', 'foo', 'foo'): 1,
         ('foo', 'foo'): 2,
         ('foo', 'foo', 'bar'): 1,
         ('foo', 'foo', 'bar', 'foo'): 1,
         ('foo', 'foo', 'sentence'): 1,
         ('foo', 'sentence'): 1,
         ('is',): 1,
         ('is', 'a'): 1,
         ('is', 'a', 'something'): 1,
         ('is', 'a', 'something', 'foo'): 1,
         ('sentence',): 1,
         ('something',): 1,
         ('something', 'foo'): 1,
         ('something', 'foo', 'foo'): 1,
         ('something', 'foo', 'foo', 'bar'): 1})

A clean way to extract the features you want:

def sent_vectorizer(sent):
    return [''.join(ng) for ng in everygrams(sent.replace(' ', '_ _'), 1, 4) 
            if ' ' not in ng and ng != ('_',)]
Counter(sent_vectorizer('This is a something foo foo bar foo foo sentence'))

[out]:

Counter({'o': 9, 's': 4, 'e': 4, 'f': 4, '_f': 4, 'fo': 4, 'oo': 4, 'o_': 4, '_fo': 4, 'foo': 4, 'oo_': 4, '_foo': 4, 'foo_': 4, 'i': 3, 'n': 3, 'h': 2, 'a': 2, 't': 2, 'hi': 2, 'is': 2, 's_': 2, '_s': 2, 'en': 2, 'is_': 2, 'T': 1, 'm': 1, 'g': 1, 'b': 1, 'r': 1, 'c': 1, 'Th': 1, '_i': 1, '_a': 1, 'a_': 1, 'so': 1, 'om': 1, 'me': 1, 'et': 1, 'th': 1, 'in': 1, 'ng': 1, 'g_': 1, '_b': 1, 'ba': 1, 'ar': 1, 'r_': 1, 'se': 1, 'nt': 1, 'te': 1, 'nc': 1, 'ce': 1, 'Thi': 1, 'his': 1, '_is': 1, '_a_': 1, '_so': 1, 'som': 1, 'ome': 1, 'met': 1, 'eth': 1, 'thi': 1, 'hin': 1, 'ing': 1, 'ng_': 1, '_ba': 1, 'bar': 1, 'ar_': 1, '_se': 1, 'sen': 1, 'ent': 1, 'nte': 1, 'ten': 1, 'enc': 1, 'nce': 1, 'This': 1, 'his_': 1, '_is_': 1, '_som': 1, 'some': 1, 'omet': 1, 'meth': 1, 'ethi': 1, 'thin': 1, 'hing': 1, 'ing_': 1, '_bar': 1, 'bar_': 1, '_sen': 1, 'sent': 1, 'ente': 1, 'nten': 1, 'tenc': 1, 'ence': 1})

In Long

Unfortunately, there's no easy way to change the hardcoded manner of how the NaiveBayesClassifier in NLTK works.

If we look at https://github.com/nltk/nltk/blob/develop/nltk/classify/naivebayes.py#L185 , behind the scenes NLTK is already counting the number of occurrence in the features.

But note, it's counting the document frequency, not term frequency, i.e. in that case regardless of how many times an element appears in the document, it counts as one. There isn't a clean way without changing the NLTK code to add the value of each feature since it's hardcoded to do +=1, https://github.com/nltk/nltk/blob/develop/nltk/classify/naivebayes.py#L201

alvas
  • 115,346
  • 109
  • 446
  • 738
  • Indeed, sklearn's `MultinomialNB` classifier works for integer features. I vectorized my data, built the classifier, then checked the most informative features using https://stackoverflow.com/a/11140887/4304516 – hb20007 Apr 06 '18 at 16:41