How are parameters set for the config in attention-based models?

Question

There are a few parameters in the config, particularly when I change the max_len, hidden_size or embedding_size.

config = {
    "max_len": 64,
    "hidden_size": 64,
    "vocab_size": vocab_size,
    "embedding_size": 128,
    "n_class": 15,
    "learning_rate": 1e-3,
    "batch_size": 32,
    "train_epoch": 20
}

I get an error:

"ValueError: Cannot feed value of shape (32, 32) for Tensor 'Placeholder:0', which has shape '(?, 64)'"

The tensorflow graph below is what I have a problem understanding. Is there a way to understand what relative max_len, hidden_size or embedding_size parameters need to be set to avoid the error I get above?

        embeddings_var = tf.Variable(tf.random_uniform([self.vocab_size, self.embedding_size], -1.0, 1.0),
                                     trainable=True)
        batch_embedded = tf.nn.embedding_lookup(embeddings_var, self.x)
        # multi-head attention
        ma = multihead_attention(queries=batch_embedded, keys=batch_embedded)
        # FFN(x) = LN(x + point-wisely NN(x))
        outputs = feedforward(ma, [self.hidden_size, self.embedding_size])
        outputs = tf.reshape(outputs, [-1, self.max_len * self.embedding_size])
        logits = tf.layers.dense(outputs, units=self.n_class)

        self.loss = tf.reduce_mean(
            tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=self.label))
        self.prediction = tf.argmax(tf.nn.softmax(logits), 1)

        # optimization
        loss_to_minimize = self.loss
        tvars = tf.trainable_variables()
        gradients = tf.gradients(loss_to_minimize, tvars, aggregation_method=tf.AggregationMethod.EXPERIMENTAL_TREE)
        grads, global_norm = tf.clip_by_global_norm(gradients, 1.0)

        self.global_step = tf.Variable(0, name="global_step", trainable=False)
        self.optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate)
        self.train_op = self.optimizer.apply_gradients(zip(grads, tvars), global_step=self.global_step,
                                                       name='train_step')
        print("graph built successfully!")

but how can I understand why the max_len being 64 doesn't help? If I want to increase my sequence length how do I adjust these parameters? — HumanTorch, Apr 16 '19 at 11:17

score 1 · Accepted Answer · answered Apr 16 '19 at 11:24

1

max_len is the length of longest sentence/document token-wise in your training set. It is the second dimension of your input tensor (the first one being batch).

Each sentence will be padded to this length. Attention models need predefined longest sentence as each token will have it's respective weight.

hidden_size is the size of of hidden RNN cell, can be set to anything which will be outputted at each time step.

embedding_size defines dimensionality of token representation (e.g. 300 is standard for word2vec, 1024 for BERT embedding etc.).

answered Apr 16 '19 at 11:24

Szymon Maszke

22,747
4
43
83

How would one input an embedded vector output with a (,1024) shape taken from BERT into this attention-based model aside from just changing the embedding_size parameter? – HumanTorch Apr 16 '19 at 11:36
It should be enough I suppose, if you encounter an error during this operation open a new issue as I'm on mobile and can't test code right now. – Szymon Maszke Apr 16 '19 at 11:53
Added a new issue here https://stackoverflow.com/questions/55709025/how-do-i-pass-bert-embeddings-into-an-attention-based-model @Szymon Maszke – HumanTorch Apr 16 '19 at 13:12

How are parameters set for the config in attention-based models?

1 Answers1