1

I'm doing some work on gender classification for a class. I've been using SVMLight with decent results, but I wanted to try some bayesian methods on my data as well. My dataset consists of text data, and I've done feature reduction to pare down the feature space to a more reasonable size for some of the bayesian methods. All of the instances are run through tf-idf and then normalized (through my own code).

I grabbed the sklearn toolkit because it was easy to integrate with my current codebase, but the results I'm getting from the GaussianNB are all of one class (-1 in this case), and the predicted probabilities are all [nan].

I've pasted some relevant code; I don't know if this is enough to go on, but I'm hoping that I'm just overlooking something obvious in using the sklearn api. I have a couple different feature sets that I've tried pushing through it, also with the same results. Same thing too using the training set and with cross-validation. Any thoughts? Could it be that my feature space simply too sparse for this to work? I have 300-odd instances, most of which have several hundred non-zero features.

class GNBLearner(BaseLearner):
    def __init__(self, featureCount):
        self.gnb = GaussianNB()
        self.featureCount = featureCount

    def train(self, instances, params):
        X = np.zeros( (len(instances), self.featureCount) )
        Y = [0]*len(instances)
        for i, inst in enumerate(instances):
            for idx,val in inst.data:
                X[i,idx-1] = val
            Y[i] = inst.c
        self.gnb.fit(X, Y)

    def test(self, instances, params):
        X = np.zeros( (len(instances), self.featureCount) )
        for i, inst in enumerate(instances):
            for idx,val in inst.data:
                X[i,idx-1] = val
        return self.gnb.predict(X)

    def conf_mtx(self, res, test_set):
        conf = [[0,0],[0,0]]
        for r, x in xzip(res, test_set):
            print "pred: %d, act: %d" % (r, x.c)
            conf[(x.c+1)/2][(r+1)/2] += 1
        return conf
Charles
  • 50,943
  • 13
  • 104
  • 142
flatline
  • 42,083
  • 4
  • 31
  • 38
  • 1
    This is really hard to tell without seeing the data as well, or at least a sample of it. First question, though: are you sure `GaussianNB` is appropriate? Are your features (roughly) Gaussians, i.e. normally distributed? – Fred Foo Apr 26 '13 at 16:08
  • Good question. I'm actually not sure what effect the tf-idf and normalization have on the distribution, but it is quite possibly not gaussian. I honestly just grabbed this from the toolkit because it handles continuous features, so it may well be a bad choice for the data. I'm still not sure if that explains the results I'm getting or not. – flatline Apr 26 '13 at 16:14
  • I'd missed the fact that they're tf-idf vectors. I'll whip up an answer. – Fred Foo Apr 26 '13 at 16:19

1 Answers1

6

GaussianNB is not a good fit for document classification at all, since tf-idf values are non-negative frequencies; use MultinomialNB instead, and maybe try BernoulliNB. scikit-learn comes with a document classification example that, incidentally, uses tf-idf weighting using the built-in TfidfTransformer.

Don't expect miracles, though, as 300 samples is quite small for a training set (although for binary classification, it might just be enough to beat a "most frequent" baseline). YMMV.

Full disclosure: I'm one of the scikit-learn core devs and the main author of the current MultinomialNB and BernoulliNB code.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Thanks - both the Multinomial and Bernoulli classifiers worked. In re. miracles, you may be surprised but I'm actually getting quite good results (~84% accuracy), on par with the SVM results. – flatline Apr 26 '13 at 17:43
  • @flatline: that's a good score! Given that you're doing gender classification, I expect the baseline to be just over 50%? – Fred Foo Apr 26 '13 at 18:28
  • The baseline is a bit more biased unfortunately - 58% male - but is still a better result than I expected at the outset. I don't think I'm going to eek anything else out of it at this point but you never know. Scikit-learn looks like a really nice package btw, I like it a lot better than weka so far. MultinomialNB/BernoulliNB at least can handle a much larger feature space than I thought I'd be able to do with bayesian methods. – flatline Apr 26 '13 at 18:50