4

This question is a tough one: How can I feed a neural network, a dynamic input?

Answering this question will certainly help the advance of modern AI using deep learning for applications other than computer vision and speech recognition. I will explain this problem further for the laymen on neural networks.

Let's take this simple example for instance:

Say you need to know the probability of winning, losing or drawing in a game of "tic-tac-toe".

So my input could be a [3,3] matrix representing the state (1-You, 2-Enemy, 0-Empty):

[2. 1. 0.]  
[0. 1. 0.] 
[2. 2. 1.]

Let's assume we already have a previously trained hidden layer, a [3,1] matrix of weights:

[1.5]  
[0.5]  
[2.5]

So if we use a simple activation function that consists basically of a matrix multiply between the two y(x)=W*x we get this [3,1] matrix in the output:

[2. 1. 0.]     [1.5]     [3.5]
[0. 1. 0.]  *  [0.5]  =  [0.5]
[2. 2. 1.]     [2.5]     [6.5]

Even without a softmax function you can tell that the highest probability is of having a draw.

But what if I want this same neural network to work for a 5x5 game of tic-tac-toe?

It has the same logic as the 3x3, its just bigger. The neural network should be able to handle it

We would have something like:

[2. 1. 0. 2. 0.]
[0. 2. 0. 1. 1.]     [1.5]     [?]
[2. 1. 0. 0. 1.]  *  [0.5]  =  [?]                           IMPOSSIBLE
[0. 0. 2. 2. 1.]     [2.5]     [?]
[2. 1. 0. 2. 0.]

But this multiplication would be impossible to compute. We would have to add more layers and/or change our previously trained one and RETRAIN it, because the untrained weights (initialized with 0 in this case) would cause the neural network to fail, like so:

     input            1st Layer        output1
[2. 1. 0. 2. 0.]     [0.  0. 0.]     [6.5 0. 0.]
[0. 2. 0. 1. 1.]     [1.5 0. 0.]     [5.5 0. 0.]
[2. 1. 0. 0. 1.]  *  [0.5 0. 0.]  =  [1.5 0. 0.]
[0. 0. 2. 2. 1.]     [2.5 0. 0.]     [6.  0. 0.]
[2. 1. 0. 2. 0.]     [0.  0. 0.]     [6.5 0. 0.]

   2nd Layer           output1      final output
                     [6.5 0. 0.]
                     [5.5 0. 0.]
[0. 0. 0. 0. 0.]  *  [1.5 0. 0.]  =  [0. 0. 0.]                POSSIBLE
                     [6.  0. 0.]
                     [6.5 0. 0.]

Because we expanded the first layer and added a new layer of zero weights, our result is obviously inconclusive. If we apply a softmax function we will realize that the neural network is returning 33.3% chance for every possible outcome. We would need to train it again.

Obviously we want to create generic neural networks that can adapt to different input sizes, however I haven't thought of a solution for this problem yet! So I thought maybe stackoverflow can help. Thousands of heads think better than one. Any ideas?

  • Like in image-approaches you would train one fixed-size NN and prescale your input (like image resizing). This is so common, that it's likely the best working approach for now. Also keep in mind, that the 1/2-coding for players 1 and 2 is far away from the best approach (-1/1 or even multiple planes). One more remark because it seems you don't have much experience with NNs and Game-AIs: the 3-value output is hm. Normally there are two approaches: a value-network or a policy-network. The former outputs ONE value describing the score (from current players perspective), the latter a PDF on moves. – sascha May 12 '16 at 16:46
  • Yeah, but there are exemples like this tic-tac-toe one that we can't resize the input without loosing information. That workes well only for images that a few pixels wont matter. We need a better way to do this. As for the remarks about the exemple, yes, probably isn't the best option, I just pieced it together quickly to help explain. – Gabriel Dias Rezende Martins May 12 '16 at 17:38

1 Answers1

1

There are solutions for Convolutional Neural Networks apart from just resizing the input to a fixed size.

Spatial Pyramid Pooling allows you to train and test CNNs with variable sized images, and it does this by introducing a dynamic pooling layer, where the input can be of any size, and the output is of a fixed size, which then can be fed to the fully connected layers.

The pooling is very simple, one defines with a number of regions in each dimension (say 7x7), and then the layer splits each feature map in non-overlapping 7x7 regions and does max-pooling on each region, outputing a 49 element vector. This can also be applied at multiple scales.

Dr. Snoopy
  • 55,122
  • 7
  • 121
  • 140
  • I have used max-pooling before while training image classification CNNs, but the thing is, there are cases that we require ALL the data from the input... we can't just max-pool an entire region and lose lots of data... For instance: each number on the "tic-tac-toe" problem above is crucial for the classification of it. If I simply remove some numbers the state would be completely different, and ruined... – Gabriel Dias Rezende Martins May 12 '16 at 18:11
  • @GabrielDiasRezendeMartins Max-Pooling is performed on convolutional feature maps, not on the input. The network learns how to interpret such pooling outputs. – Dr. Snoopy May 12 '16 at 18:15
  • And what would be the feature maps for an exemple like this one? I know that is how it works for imagens, but this is a state matrix only. And each number counts. – Gabriel Dias Rezende Martins May 12 '16 at 18:42
  • @GabrielDiasRezendeMartins An image is just a matrix, the feature maps are learned (through filters) by the Convolutional Neural Network. This kind of network can be applied to any kind of matrix input that has local structures. You have to try to see if it works. – Dr. Snoopy May 12 '16 at 18:50
  • Are you implying that I apply convolution on the state into a 3rd dimension and then max pool it into a fixed shape? I hardly think it would work. Lets say that the 5x5 turns into a 3x3. How can the NN know that this 3x3 state is actually representing a 5x5 instead of just actually being a 3x3? This 3x3 matrix of integers will just be processed as if it were 3x3 to begin with, the original state is lost. If we take the 4 cells in any corner of the 5x5 (ex: 0 1 and 1 0) and turn it into a single cell, the corner of a 3x3 (ex: 1), all that info is lost and now is just a single play instead of 4. – Gabriel Dias Rezende Martins May 12 '16 at 19:03
  • The network learns its own representation of the state, you cannot predice beforehand what will it learn, because it completely depends on the training data. Just try it, it works pretty well for image classification and other tasks, and yours isn't very different. – Dr. Snoopy May 12 '16 at 21:59
  • 1
    @GabrielDiasRezendeMartins You don't turn a 5x5 matrix into a 3x3 matrix with a convolution + max pooling. If you do that information will be inevitably lost, just like you say. Instead you always create multiple channels for every convolution. For example if you use 16 channels, a 5x5 matrix would turn into a 3x3x16 tensor. The network will learn how to represent the 5x5 matrix with multiple (16) instances of the reduced 3x3 state. – BlueSun Apr 23 '17 at 15:06