I am a bit confused about the use of the function tf.matmul()
in TensorFlow. My question might be more about the theory of deep learning, though. Say you have an input X and weight matrix W (assuming zero bias), I want to compute WX as an output which could be done by tf.matmul(W, X)
. However, in the tutorial MNIST for beginners it is reversed and tf.matmul(X, W)
is used instead. On the other hand, in the next tutorial TensorFlow Mechanics 101, tf.matmul(W, X)
is used. Since the matrix sizes are important for multiplication I wonder if someone can clarify this issue.

- 3,627
- 3
- 28
- 48

- 187
- 3
- 12
3 Answers
I think you must be misreading the mechanics 101 tutorial - or could you point to the specific line?
In general, for a network layer, I think of the inputs "flowing through" the weights. To represent that, I write tf.matmul(Inputs, Weights)
to produce the output of that layer. That output may then have a bias b
added to it, and the result of that fed into a nonlinear function such as a relu, and then into another tf.matmul
as the input for the next layer.
Second, remember that the Weights matrix may be sized to produce multiple outputs. That's why it's a matrix, not just a vector. For example, if you wanted two hidden units and you had five input features, you would use a shape [5, 2]
weight matrix, like this (shown in numpy for ease of exposition - you can do the same thing in tensorflow):
import numpy as np
a = np.array([1, 2, 3, 4, 5])
W = np.array([[.5, .6], [.7, .8], [.9, .1], [.2, .3], [.4, .5]])
>>> np.dot(a, W)
array([ 7.4, 6.2])
This has the nice behavior that if you then add a batch dimension to a
, it still works:
a = np.array[[1, 2, 3, 4, 5],
[6, 7, 8, 9, 0]]
>>> np.dot(a, W)
array([[ 7.4, 6.2],
[ 20.9, 17.7]])
This is exactly what you're doing when you use tf.matmul to go from input features to hidden units, or from one layer of hidden units to another.

- 21,757
- 3
- 44
- 51
-
Thanks for the answer but I am still confused. We need to compute Weights * Inputs so why not tf.matmul(Weights, Inputs)? tf.matmul(a, W) produces a * W instead of W * a. – sergulaydore Jul 01 '16 at 10:47
-
I think of it this way: Imagine you have 5 activations coming in to your weight matrix, and you want there to be 2 outputs from this computation. Your "input size" to the layer is 5, and your "output" size from the layer is 2. Further, you have a batch size B. I find a natural representation of this is that your input is `[B, 5]` with the first dimension being the batch. If you set up your weight matrix as a `[5x2]` matrix, then you can multiply any batch size in: `[B x 5] * [5 x 2] --> [B, 2]`. You could, of course, transpose both matrices and multiply `W_t*a_t`. – dga Jul 02 '16 at 14:46
-
Unfortunately as discussed here: http://stackoverflow.com/a/34908326/281545 np.dot does not match tf.matmul semantics - in particular both operands must be matrices. Any workarounds ? – Mr_and_Mrs_D Apr 07 '17 at 14:43
I don't know much about TensorFlow, but intuitively I feel that the confusion is regarding the data representation of input. When you say you want to multiply an input X
with a weight W
I think what you mean is that you want to multiply each dimension (feature) with its corresponding weight and take the sum. So if you have an input x
with say m
dimensions, you should have a weight vector w
with m
values (m+1
if you consider the bias).
Now if you choose to represent the different training instances as rows of a matrix X
, you would have to perform X * w
, instead if you choose to represent them as columns, you would do w^T * X

- 1,057
- 8
- 13
-
1If you want multiple training instances in a batch, you need to use `tf.batch_matmul`, which internally treats the first dimension as a batch dimension. Remember that weights can be a matrix, not a vector: You may produce multiple outputs based upon different weights of the input features. I've updated my answer to point this out. – dga Dec 16 '15 at 18:19
-
Thanks @jMathew. I think you are right. I was assuming that the input should be represented as (n_Features x n_Samples) but it seems to be the other way around in most of the examples. dga, it has nothing to do with whether or not W is a vector or we are feeding bathces. – sergulaydore Jul 01 '16 at 10:55
Elaborating more on the answer given by @jMathew.
It all depends on how you represent your feature vector x
- Representation such that rows = instances and columns = features:
Here, let's consider the dimension of x
is m x n
then we have m
instances and n
features. For this, the weight vector w
would be of the shape n x z
where z
is the number of neurons in that layer. To multiply x
and w
we have to do x * W
so that the shapes of the 2 vectors match for a legal matrix multiplication operation to happen.
- Representation such that rows = features and columns = instances:
Here, we will have to do W^T * x
inorder to ensure that the shapes of the 2 vectors match for a legal matrix multiplication operation to happen.

- 341
- 4
- 14