3

Refer to this post to know the background of the problem: Does the TensorFlow embedding_attention_seq2seq method implement a bidirectional RNN Encoder by default?

I am working on the same model, and want to replace the unidirectional LSTM layer with a Bidirectional layer. I realize I have to use static_bidirectional_rnn instead of static_rnn, but I am getting an error due to some mismatch in the tensor shape.

I replaced the following line:

encoder_outputs, encoder_state = core_rnn.static_rnn(encoder_cell, encoder_inputs, dtype=dtype)

with the line below:

encoder_outputs, encoder_state_fw, encoder_state_bw = core_rnn.static_bidirectional_rnn(encoder_cell, encoder_cell, encoder_inputs, dtype=dtype)

That gives me the following error:

InvalidArgumentError (see above for traceback): Incompatible shapes: [32,5,1,256] vs. [16,1,1,256] [[Node: gradients/model_with_buckets/embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/Attention_0/add_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"](gradients/model_with_buckets/embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/Attention_0/add_grad/Shape, gradients/model_with_buckets/embedding_attention_seq2seq/embedding_attention_decoder/attention_decoder/Attention_0/add_grad/Shape_1)]]

I understand that the outputs of both the methods are different, but I do not know how to modify attention code to incorporate that. How do I send both the forward and backward states to the attention module- do I concatenate both the hidden states?

Daniel
  • 1,229
  • 14
  • 24

1 Answers1

1

I find from the error message that the batch size of two tensors somewhere don't match, one is 32 and the other is 16. I suppose it is because the output list of the bidirectional rnn is double sized of that of the unidirectional one. And you just don't adjust to that in the following code accordingly.

How do I send both the forward and backward states to the attention module- do I concatenate both the hidden states?

You can reference this code:

  def _reduce_states(self, fw_st, bw_st):
    """Add to the graph a linear layer to reduce the encoder's final FW and BW state into a single initial state for the decoder. This is needed because the encoder is bidirectional but the decoder is not.
    Args:
      fw_st: LSTMStateTuple with hidden_dim units.
      bw_st: LSTMStateTuple with hidden_dim units.
    Returns:
      state: LSTMStateTuple with hidden_dim units.
    """
    hidden_dim = self._hps.hidden_dim
    with tf.variable_scope('reduce_final_st'):

      # Define weights and biases to reduce the cell and reduce the state
      w_reduce_c = tf.get_variable('w_reduce_c', [hidden_dim * 2, hidden_dim], dtype=tf.float32, initializer=self.trunc_norm_init)
      w_reduce_h = tf.get_variable('w_reduce_h', [hidden_dim * 2, hidden_dim], dtype=tf.float32, initializer=self.trunc_norm_init)
      bias_reduce_c = tf.get_variable('bias_reduce_c', [hidden_dim], dtype=tf.float32, initializer=self.trunc_norm_init)
      bias_reduce_h = tf.get_variable('bias_reduce_h', [hidden_dim], dtype=tf.float32, initializer=self.trunc_norm_init)

      # Apply linear layer
      old_c = tf.concat(axis=1, values=[fw_st.c, bw_st.c]) # Concatenation of fw and bw cell
      old_h = tf.concat(axis=1, values=[fw_st.h, bw_st.h]) # Concatenation of fw and bw state
      new_c = tf.nn.relu(tf.matmul(old_c, w_reduce_c) + bias_reduce_c) # Get new cell from old cell
      new_h = tf.nn.relu(tf.matmul(old_h, w_reduce_h) + bias_reduce_h) # Get new state from old state
return tf.contrib.rnn.LSTMStateTuple(new_c, new_h) # Return new cell and state
Lerner Zhang
  • 6,184
  • 2
  • 49
  • 66
  • This seems to be what I was looking for. Let me try it and update if this works. Thank you. – Leena Shekhar Jul 16 '17 at 15:09
  • This seems to work, but I have a question: Why can't I simply double the size of the decoding cell rather than projecting the encoding cell states to half the size? I see that this will reduce the number of parameters in the model, but am I not going to lose information due to the projection that I am doing? – Leena Shekhar Jul 18 '17 at 16:42
  • @LeenaShekhar Doubling the size of the decoding cell size is practical also. Here you'd better make the two states of the bidirectional encoder into one(to make both the encoder and decoder have the same cell size to keep you from error), which is done by performing a projection like that above separately for the c and h. – Lerner Zhang Nov 02 '17 at 06:04