Why do Tensorflow tf.learn classification results vary a lot?

Question

I use the TensorFlow high-level API tf.learn to train and evaluate a DNN classifier for a series of binary text classifications (actually I need multi-label classification but at the moment I check every label separately). My code is very similar to the tf.learn Tutorial

classifier = tf.contrib.learn.DNNClassifier(
    hidden_units=[10],
    n_classes=2,
    dropout=0.1,
    feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(training_set.data))
classifier.fit(x=training_set.data, y=training_set.target, steps=100)
val_accuracy_score = classifier.evaluate(x=validation_set.data, y=validation_set.target)["accuracy"]

Accuracy score varies roughly from 54% to 90%, with 21 documents in the validation (test) set which are always the same.

What does the very significant deviation mean? I understand there are some random factors (eg. dropout), but to my understanding the model should converge towards an optimum.

I use words (lemmas), bi- and trigrams, sentiment scores and LIWC scores as features, so I do have a very high-dimensional feature space, with only 28 training and 21 validation documents. Can this cause problems? How can I consistently improve the results apart from collecting more training data?

Update: To clarify, I generate a dictionary of occurring words and n-grams and discard those that occur only 1 time, so I only use words (n-grams) that exist in the corpus.

lejlot · Accepted Answer · 2016-09-10T19:48:44.380

2

This has nothing to do with TensorFlow. This dataset is ridiculously small, thus you can obtain any results. You have 28 + 21 points, in a space which has "infinite" amount of dimensions (there are around 1,000,000 english words, thus 10^18 trigrams, however some of them do not exist, and for sure they do not exist in your 49 documents, but still you have at least 1,000,000 dimensions). For such problem, you have to expect huge variance of the results.

How can I consistently improve the results apart from collecting more training data?

You pretty much cannot. This is simply way to small sample to do any statistical analysis.

Consequently the best you can do is change evaluation scheme instead of splitting data to 28/21 do 10-fold cross validation, with ~50 points this means that you will have to run 10 experiments, each with 45 training documents and 4 testing ones, and average the result. This is the only thing you can do to reduce the variance, however remember that even with CV, dataset so small gives you no guarantees how well your model will actualy behave "in the wild" (once applied to never seen before data).

edited Sep 10 '16 at 19:48

answered Sep 10 '16 at 18:35

lejlot

64,777
8
131
164

I updated my post: I use reduced word and n-gram spaces (up to 19k feature dimensions depending on which feature sets I use). Still, could you clarify a bit? Should results not converge, by "discarding" (reducing weights of) not relevant features or those not correlated with the output labels? Does what you are saying imply that I can use feature selection / dimensionality reduction to get more consistent results? My result are still at least equal to baseline which gets about 53% accuracy. – Pawit Sep 10 '16 at 18:49
What I am saying is "with that amount of data, nothing can help", even estimating score itself on 21 points is invalid, thus you can't even say if your model is good or bad. Dimensionality of the data only makes it worse, but does not change the fact, that even if these points were in the 100 dimensional space, this would still be way too much. With 49 points you could probably have some decent statistics with **2** or **3** features. With 10 you could probably say that "it roughly makes some sense". With more than 10 it is like chiromancy. – lejlot Sep 10 '16 at 19:41
It took me a while to understand that... Well, I am only building a prototype, I don't actually need _good_ results, I could just use some ways to test and tweak the code / model, and with those random results I cannot do that. So from what I understand there are too many ways to fit the model to the training data, some of which are better for the validation data then the others,.right? – Pawit Sep 11 '16 at 11:40
The problem is, with such small amount of data, you cannot **test**. Simply richness of hypotheses set is way beyond what you can cover with your testing samples, thus no matter what you get on your "testing set" can have nothing to do with real data. In other words - you cannot draw **any conclusions** from such experiments. Consequently you will create false assumptions due to some arbitrary scores obtained. Gather more data, this is the only way to proceed with statistical analysis. – lejlot Sep 11 '16 at 14:37

Why do Tensorflow tf.learn classification results vary a lot?

1 Answers1

Linked