0

I have written tensorflow code based on:

http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

but using precomputed word embeddings from the GoogleNews word2vec 300 dimension model.

I created my own data from the UCML News Aggregator Dataset in which I parsed the content of the news articles and have created my own labels.

Due to the size of the articles I use TF-IDF to filter out the top 120 words per article and embed those into 300 dimensions.

When I run the CNN I created regardless of the hyper parameters it converges to a small general accuracy, around 38%.

Hyper parameters changed:

Various filter sizes:

I've tried a single filter of 1,2,3 Combinations of filters [3,4,5], [1,3,4]

Learning Rate:

I've varied this from very low to very high, very low doesn't converge to 38% but anything between 0.0001 and 0.4 does.

Batch Size:

Tried many ranges between 5 and 100.

Weight and Bias Initialization:

Set stddev of weights between 0.4 and 0.01. Set bias initial values between 0 and 0.1. Tried using the xavier initializer for the conv2d weights.

Dataset Size:

I have only tried on two partial data sets, one with 15 000 training data, and the other on the 5000 test data. In total I have 263 000 data to train on. There is no accuracy difference whether trained and evaluated on the 15 000 training data or by using the 5000 test data as the training data (to save testing time).

I've run successful classifications on the 15 000 / 5000 split using a feed forward network with a BoW input (93% accurate), TF-IDF with SVM (92%), and TF-IDF with Native Bayes (91.5%). So I don't think it is the data.

What does this imply? Is the model just a poor model for this task? Is there an error in my work?

I feel like my do_eval function is incorrect to evaluate the accuracy / loss over an epoch of the data:

        def do_eval(data_set,
                label_set,
                batch_size):
            """
            Runs one evaluation against the full epoch of data.
            data_set: The set of embeddings to eval
            label_set: the set of labels to eval
            """
            # And run one epoch of eval.

            true_count = 0  # Counts the number of correct predictions.
            steps_per_epoch = len(label_set) // batch_size
            num_examples = steps_per_epoch * batch_size
            totalLoss = 0
            # Need to compute eval accuracy
            for evalStep in xrange(steps_per_epoch):
                input_batch, label_batch = nextBatch(data_set, labels_set, batchSize)
                evalAcc, evalLoss = eval_step(input_batch, label_batch)
                true_count += evalAcc * batchSize
                totalLoss += evalLoss
            precision = float(true_count) / num_examples
            print('  Num examples: %d  Num correct: %d  Precision @ 1: %0.04f' % (num_examples, true_count, precision))
            print("Eval Loss: " + str(totalLoss))

The entire model is as follows:

class TextCNN(object):
"""
A CNN for text classification
Uses a convolutional, max-pooling and softmax layer.
"""

    def __init__(
            self, batchSize, numWords, num_classes,
            embedding_size, filter_sizes, num_filters):

        # Set place holders
        self.input_placeholder = tf.placeholder(tf.float32,[batchSize,numWords,embedding_size,1])
        self.labels = tf.placeholder(tf.int32, [batchSize,num_classes])
        self.pKeep = tf.placeholder(tf.float32)

        # Inference
        '''
        Ready to build conv layers followed by max pooling layers
        Each conv layer produces a different shaped output so need to loop over
        them and create a layer for each and then merge the results
        '''
        pooled_outputs = []
        for i, filter_size in enumerate(filter_sizes):
            with tf.name_scope("conv-maxpool-%s" % filter_size):
                # Convolution Layer
                filter_shape = [filter_size, embedding_size, 1, num_filters]

                # W: Filter matrix
                W = tf.Variable(tf.truncated_normal(filter_shape,stddev=0.01), name='W')
                b = tf.Variable(tf.constant(0.0,shape=[num_filters]),name="b")


                # Valid padding: Narrow convolution (no edge padded so filter slides over everything)
                # Output size = (input_size (numWords in this case) + 2 * padding (0 in this case) - filter_size) + 1
                conv = tf.nn.conv2d(
                    self.input_placeholder,
                    W,
                    strides=[1, 1, 1, 1],
                    padding="VALID",
                    name="conv")

                # Apply nonlinearity i.e add the bias to Wx + b
                # Where Wx is the conv layer above
                # Then run it through the activation function
                h = tf.nn.relu(tf.nn.bias_add(conv, b),name='relu')

                # Max-pooling over the outputs
                # Max-pool to control the output size
                # By taking only the best features determined by the filter
                # Ksize is the size of the window of the input tensor
                pooled = tf.nn.max_pool(
                    h,
                    ksize=[1, numWords - filter_size + 1, 1, 1],
                    strides=[1, 1, 1, 1],
                    padding='VALID',
                    name="pool")

                # Each pooled outputs a tensor of size
                # [batchSize, 1, 1, num_filters] where num_filters represents the
                # Number of features we wanted pooled
                pooled_outputs.append(pooled)

        # Combine all pooled features
        num_filters_total = num_filters * len(filter_sizes)
        # Concat the pool output along the 3rd (num_filters / feature size) dimension
        self.h_pool = tf.concat(pooled_outputs, 3)
        # Flatten
        self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])

        # Add drop out to regularize the learning curve / accuracy
        with tf.name_scope("dropout"):
            self.h_drop = tf.nn.dropout(self.h_pool_flat,self.pKeep)

        # Fully connected output layer
        with tf.name_scope("output"):
            W = tf.Variable(tf.truncated_normal([num_filters_total,num_classes],stddev=0.01),name="W")
            b = tf.Variable(tf.constant(0.0,shape=[num_classes]), name='b')
            self.logits = tf.nn.xw_plus_b(self.h_drop, W, b, name='logits')
            self.predictions = tf.argmax(self.logits, 1, name='predictions')

        # Loss
        with tf.name_scope("loss"):
            losses = tf.nn.softmax_cross_entropy_with_logits(labels=self.labels,logits=self.logits, name="xentropy")
            self.loss = tf.reduce_mean(losses)

        # Accuracy
        with tf.name_scope("accuracy"):
            correct_predictions = tf.equal(self.predictions, tf.argmax(self.labels,1))
            self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")

     ##################################################################################################################
# Running the training
# Define various parameters for network

batchSize = 100
numWords = 120
embedding_size = 300
num_classes = 4
filter_sizes = [3,4,5] # slide over a the number of words, i.e 3 words, 4     words etc...
num_filters = 126
maxSteps = 5000
initial_learning_rate = 0.001
dropoutRate = 1


data_set = np.load("/home/kevin/Documents/NSERC_2017/articles/classifyDataSet/TestSmaller_CNN_inputMat_0.npy")
labels_set = np.load("Test_NN_target_smaller.npy")


with tf.Graph().as_default():

    sess = tf.Session()

    with sess.as_default():
    cnn = TextCNN(batchSize=batchSize,
                  numWords=numWords,
                  num_classes=num_classes,
                  num_filters=num_filters,
                  embedding_size=embedding_size,
                  filter_sizes=filter_sizes)

        # Define training operation
        # Pick an optimizer, set it's learning rate, and tell it what to minimize

        global_step = tf.Variable(0,name='global_step', trainable=False)
        optimizer = tf.train.AdamOptimizer(initial_learning_rate)
        grads_and_vars = optimizer.compute_gradients(cnn.loss)
        train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)

        # Summaries to save for tensor board

        # Set directory
        out_dir = "/home/kevin/Documents/NSERC_2017/articles/classifyDataSet/tf_logs/CNN_Embedding/"

        # Loss and accuracy summaries
        loss_summary = tf.summary.scalar("loss",cnn.loss)
        acc_summary = tf.summary.scalar("accuracy", cnn.accuracy)

        # Train summaries
        train_summary_op = tf.summary.merge([loss_summary,acc_summary])
        train_summary_dir = out_dir + "train/"
        train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)

        # Test summaries
        test_summary_op = tf.summary.merge([loss_summary, acc_summary])
        test_summary_dir = out_dir + "test/"
        test_summary_write = tf.summary.FileWriter(test_summary_dir, sess.graph)

        # Init all variables

        init = tf.global_variables_initializer()
        sess.run(init)

    ############################################################################################

        def train_step(input_data, labels_data):
            '''
            Single training step
            :param input_data: input
            :param labels_data: labels to train to
            '''
            feed_dict = {
                cnn.input_placeholder: input_data,
                cnn.labels: labels_data,
                cnn.pKeep: dropoutRate
            }
            _, step, summaries, loss, accuracy = sess.run(
                [train_op, global_step, train_summary_op, cnn.loss, cnn.accuracy],
            feed_dict=feed_dict)
            train_summary_writer.add_summary(summaries, step)


    ###############################################################################################

        def eval_step(input_data, labels_data, writer=None):
            """
            Evaluates model on a test set
            Single step
            """
            feed_dict = {
            cnn.input_placeholder: input_data,
            cnn.labels: labels_data,
            cnn.pKeep: 1.0
            }

            step, summaries, loss, accuracy = sess.run(
            [global_step, test_summary_op, cnn.loss, cnn.accuracy],
            feed_dict)
            if writer:
                writer.add_summary(summaries, step)
        return accuracy, loss

    ###############################################################################

        def nextBatch(data_set, labels_set, batchSize):
            '''
            Get the next batch of data
            :param data_set: entire training or test data set
            :param labels_set: entire training or test label set
            :param batchSize: batch size
            :return: a batch of the data and it's corresponding labels
            '''
            # Generate random row indices for the documents
            rand_index = np.random.choice(data_set.shape[0], size=batchSize)

            # Grab the data to give to the feed dicts
            data_batch, labels_batch = data_set[rand_index, :, :], labels_set[rand_index, :]

            # Resize for tensorflow
            data_batch = data_batch.reshape([data_batch.shape[0],data_batch.shape[1],data_batch.shape[2],1])
            return data_batch, labels_batch
 ################################################################################

        def do_eval(data_set,
                label_set,
                batch_size):
            """
            Runs one evaluation against the full epoch of data.
            data_set: The set of embeddings to eval
            label_set: the set of labels to eval
            """
            # And run one epoch of eval.

            true_count = 0  # Counts the number of correct predictions.
            steps_per_epoch = len(label_set) // batch_size
            num_examples = steps_per_epoch * batch_size
            totalLoss = 0
            # Need to compute eval accuracy
            for evalStep in xrange(steps_per_epoch):
                input_batch, label_batch = nextBatch(data_set, labels_set, batchSize)
                evalAcc, evalLoss = eval_step(input_batch, label_batch)
                true_count += evalAcc * batchSize
                totalLoss += evalLoss
            precision = float(true_count) / num_examples
            print('  Num examples: %d  Num correct: %d  Precision @ 1: %0.04f' % (num_examples, true_count, precision))
            print("Eval Loss: " + str(totalLoss))

    ######################################################################################################
        # Training Loop

        for step in range(maxSteps):
            input_batch, label_batch = nextBatch(data_set,labels_set,batchSize)
            train_step(input_batch,label_batch)

        # Evaluate over the entire data set on last eval
            if step  % 100 == 0:
                print "On Step : " + str(step) + " of " + str(maxSteps)
                do_eval(data_set, labels_set,batchSize)

The embedding is done before the model:

def createInputEmbeddedMatrix(corpusPath, maxWords, svName):
    # Create a [docNum, Words per Art, Embedding Size] matrix to fill

    genDocsPath = "gen_docs_classifyData_smallerTest_TFIDF.npy"
    # corpus = "newsCorpus_word2vec_All_Corpus.mm"
    dictPath = 'news_word2vec_smallerDict.dict'
    tf_idf_path = "news_tfIdf_word2vec_All.tfidf_model"

    gen_docs = np.load(genDocsPath)
    dictionary = gensim.corpora.dictionary.Dictionary.load(dictPath)
    tf_idf = gensim.models.tfidfmodel.TfidfModel.load(tf_idf_path)

    corpus = corpora.MmCorpus(corpusPath)
    numOfDocs = len(corpus)
    embedding_size = 300

    id2embedding = np.load("smallerID2embedding.npy").item()

    # Need to process in batches as takes up a ton of memory

    step = 5000
    totalSteps = int(np.ceil(numOfDocs / step))

    for i in range(totalSteps):
        # inputMatrix = scipy.sparse.csr_matrix([step,maxWords,embedding_size])
        inputMatrix = np.zeros([step, maxWords, embedding_size])
        start = i * step
        end = start + step
        for docNum in range(start, end):
            print "On docNum " + str(docNum) + " of " + str(numOfDocs)
            # Extract the top N words
            topWords, wordVal = tf_idfTopWords(docNum, gen_docs, dictionary, tf_idf, maxWords)
            # doc = corpus[docNum]
            # Need to track word dex and doc dex seperate
            # Doc dex because of the batch processing
            wordDex = 0
            docDex = 0
            for wordID in wordVal:
                inputMatrix[docDex, wordDex, :] = id2embedding[wordID]
                wordDex += 1
            docDex += 1

        # Save the batch of input data
        # scipy.sparse.save_npz(svName + "_%d"  % i, inputMatrix)
        np.save(svName + "_%d.npy" % i, inputMatrix)


#####################################################################################
Kevinj22
  • 966
  • 2
  • 7
  • 11
  • The wildml blog post you referenced is a binary classification problem. And it seems that you are performing a multi-class and/or multi-label classification problem. (e.g., one document could have multiple correct labels). Definitely the original `cnn.accuracy` metric definition AND the loss function may not be suitable for your case. You can simply approve it by keeping only one labels per doc and see if you have better accuracy results. – greeness Sep 19 '17 at 23:22
  • btw, i don't understand your code below: `steps_per_epoch = len(label_set)// batch_size` `num_examples = steps_per_epoch * batch_size`. Why do we want to divide the "the number of labels to eval" by a batch size? Can you please explain a little bit? – greeness Sep 19 '17 at 23:32
  • Thanks for your comments. I found the error in my data set creating function. I was resetting the docDex to 0 when I shouldn't have been and thus only wrote a single article's worth of data. As for the steps per epoch, the length of the labels (num of rows) is the total amount of data I have, if I divide that by the batch size I know how many batches to do for a full epoch. – Kevinj22 Sep 20 '17 at 00:12
  • As for the labels they are done as a one hot vector i.e [1, 0 ,0 ,0] thus there is only one class per article. I believe the tensorflow docs mentioned this is fine with softmax_cross_entropy_with_logits. As an alternative I could give just a numeric value 1 thru 4 of which output node should be the max and use sparse_softmax_cross_entropy_with_logits. – Kevinj22 Sep 20 '17 at 00:17
  • I see so you were training on only one article of data. .. thanks for the update. – greeness Sep 20 '17 at 00:41

1 Answers1

0

Turns out my error was in the creation of the input matrix.

for i in range(totalSteps):
    # inputMatrix = scipy.sparse.csr_matrix([step,maxWords,embedding_size])
    inputMatrix = np.zeros([step, maxWords, embedding_size])
    start = i * step
    end = start + step
    for docNum in range(start, end):
        print "On docNum " + str(docNum) + " of " + str(numOfDocs)
        # Extract the top N words
        topWords, wordVal = tf_idfTopWords(docNum, gen_docs, dictionary, tf_idf, maxWords)
        # doc = corpus[docNum]
        # Need to track word dex and doc dex seperate
        # Doc dex because of the batch processing
        wordDex = 0
        docDex = 0
        for wordID in wordVal:
            inputMatrix[docDex, wordDex, :] = id2embedding[wordID]
            wordDex += 1
        docDex += 1

docDex should not have been reset to 0 on each iteration of the inner loop, I was effectively overwriting the first row of my input matrix and thus the rest were 0's.

Kevinj22
  • 966
  • 2
  • 7
  • 11