Positional Encodings leads to worse convergence, language modeling

Question

This is a tough question, but I might as well try. I'm implementing the architecture from this paper https://arxiv.org/pdf/1503.08895.pdf for language modeling. See page 2 for a diagram, and the top of page 5 for the section on positional or "temporal" encoding. More on positional encoding can be found here, https://arxiv.org/pdf/1706.03762.pdf at the bottom of page 5/top of page 6. (I was directed to that second paper by the authors of the first.)

So here's my keras implementation in a nutshell:

word_seq = Input(shape = (SEQ_LEN,), dtype = "int32", name = "word_seq")

query = Input(shape = (EMBED_DIM, ), dtype = "float32", name = "q_input")
#the query for lang. modeling is a constant vector filled with 0.1, as described at the bottom of page 7 in the first linked paper

T_A = Added_Weights(input_dim = (SEQ_LEN, EMBED_DIM))
#Added_Weights is a custom layer I wrote, which I'll post below
#These are the "positional encoding" components

T_C = Added_Weights(input_dim = (SEQ_LEN, EMBED_DIM))

Emb_A = Embedding(output_dim = EMBED_DIM, input_dim = VOCAB_SIZE, input_length = SEQ_LEN, name = "Emb_A")

Emb_C = Embedding(output_dim = EMBED_DIM, input_dim = VOCAB_SIZE, input_length = SEQ_LEN, name = "Emb_C")

int_state_weights = Dense(units = EMBED_DIM, activation = 'linear',
           kernel_initializer=RandomNormal(mean=0., stddev = 0.05, seed = None))

layer_output = query
#the loop uses the output from the previous layer as the query, but the first layer's query is just that constant vector

for i in range(0, NUM_LAYERS - 1):
    memories = Emb_A(word_seq) #these all re-use the weights instantiated earlier.

    memories = T_A(memories)

    memories = Dropout(DROPOUT_R)(memories)

    content = Emb_C(word_seq)

    content = T_C(content)

    mem_relevance = Dot(axes=[1, 2])([layer_output, memories])

    weighted_internal_state = int_state_weights(mem_relevance)

    mem_relevance = Softmax()(mem_relevance)

    content_relevance = Dot(axes=1)([mem_relevance,
                                content])  # weight each piece of content by it's probability of being relevant

    layer_output = Add()([content_relevance, weighted_internal_state])

    layer_output = Dropout(DROPOUT_R)(layer_output)

final_output = Dense(units = VOCAB_SIZE, activation ='relu',
                 kernel_initializer=RandomNormal(mean=0., stddev = 0.05, seed = None))(layer_output)

model = Model(inputs = [word_seq, query], outputs = prediction)
model.compile(optimizer = SGD(lr = 0.01, clipnorm = 50.), loss = 'categorical_crossentropy', metrics = ['accuracy'])
model.fit(x = [td_seqs, td_query], y = [td_labels],
      batch_size = BATCH_SIZE, callbacks = [lr_adjust, lr_termination, for_csv], epochs=200, verbose = 1)

BATCH_SIZE is currently 128. This went well on ~35,000 training samples BEFORE I added the T_A and T_C parts, ending at 96% accuracy. As soon as I implemented T_A and T_C (the positional encoding), training ended at around 10% accuracy and 5.2-ish training loss. I increased the training data by a factor of 10 and didn't see any real improvement. Here's my Added_Weights class:

class Added_Weights(Layer):

    def __init__(self, input_dim, **kwargs):
        super(Added_Weights, self).__init__(**kwargs)
        self.input_dim = input_dim

    def build(self, input_shape):
        # Create a trainable weight variable for this layer.
        self.kernel = self.add_weight(name='kernel',
                                  shape=(self.input_dim[0], self.input_dim[1]),
                                  initializer=RandomNormal(mean=0., stddev=0.05, seed=None),
                                  trainable=True)


        super(Added_Weights, self).build(input_shape)  


    def call(self, x, **kwargs):
        return x + self.kernel

    def compute_output_shape(self, input_shape):
        return input_shape

I am agonizing over why this won't work, after reading both of these awesome papers explicitly stating that it SHOULD work. If anyone can manage to help with this, that would be amazing.

I don't see how this implements positional encoding mentioned in the paper. You are just adding weights. — nuric, May 17 '18 at 21:16
That could very well be the problem, but what exactly is different in the papers? Both of them sound like they're just adding weights. — Sean Paulsen, May 18 '18 at 21:54

Positional Encodings leads to worse convergence, language modeling

0 Answers0

Linked