LSTM in DL4J - All output values are the same

Question

I'm trying to create a simple LSTM using DeepLearning4J, with 2 input features and a timeseries length of 1. I'm having a strange issue however; after training the network, inputting test data yields the same, arbitrary result regardless of the input values. My code is shown below.

(UPDATED)

public class LSTMRegression {
    public static final int inputSize = 2,
                            lstmLayerSize = 4,
                            outputSize = 1;
    
    public static final double learningRate = 0.0001;

    public static void main(String[] args) {
        int miniBatchSize = 99;
        
        MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
                .miniBatch(false)
                .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
                .updater(new Adam(learningRate))
                .list()
                .layer(0, new LSTM.Builder().nIn(inputSize).nOut(lstmLayerSize)
                        .weightInit(WeightInit.XAVIER)
                        .activation(Activation.TANH).build())
//                .layer(1, new LSTM.Builder().nIn(lstmLayerSize).nOut(lstmLayerSize)
//                        .weightInit(WeightInit.XAVIER)
//                        .activation(Activation.SIGMOID).build())
//                .layer(2, new LSTM.Builder().nIn(lstmLayerSize).nOut(lstmLayerSize)
//                        .weightInit(WeightInit.XAVIER)
//                        .activation(Activation.SIGMOID).build())
                .layer(1, new RnnOutputLayer.Builder(LossFunctions.LossFunction.MSE)
                        .weightInit(WeightInit.XAVIER)
                        .activation(Activation.IDENTITY)
                        .nIn(lstmLayerSize).nOut(outputSize).build())
                
                .backpropType(BackpropType.TruncatedBPTT)
                .tBPTTForwardLength(miniBatchSize)
                .tBPTTBackwardLength(miniBatchSize)
                .build();
        
        final var network = new MultiLayerNetwork(conf);
        final DataSet train = getTrain();
        final INDArray test = getTest();
        
        final DataNormalization normalizer = new NormalizerMinMaxScaler(0, 1);
//                                          = new NormalizerStandardize();
        
        normalizer.fitLabel(true);
        normalizer.fit(train);

        normalizer.transform(train);
        normalizer.transform(test);
        
        network.init();
        
        for (int i = 0; i < 100; i++)
            network.fit(train);
        
        final INDArray output = network.output(test);
        
        normalizer.revertLabels(output);
        
        System.out.println(output);
    }
    
    public static INDArray getTest() {
        double[][][] test = new double[][][]{
            {{20}, {203}},
            {{16}, {183}},
            {{20}, {190}},
            {{18.6}, {193}},
            {{18.9}, {184}},
            {{17.2}, {199}},
            {{20}, {190}},
            {{17}, {181}},
            {{19}, {197}},
            {{16.5}, {198}},
            ...
        };
        
        INDArray input = Nd4j.create(test);
        
        return input;
    }
    
    public static DataSet getTrain() {
        double[][][] inputArray = {
            {{18.7}, {181}},
            {{17.4}, {186}},
            {{18}, {195}},
            {{19.3}, {193}},
            {{20.6}, {190}},
            {{17.8}, {181}},
            {{19.6}, {195}},
            {{18.1}, {193}},
            {{20.2}, {190}},
            {{17.1}, {186}},
            ...
        };
        
        double[][] outputArray = {
                {3750},
                {3800},
                {3250},
                {3450},
                {3650},
                {3625},
                {4675},
                {3475},
                {4250},
                {3300},
                ...
        };
        
        INDArray input = Nd4j.create(inputArray);
        INDArray labels = Nd4j.create(outputArray);
        
        return new DataSet(input, labels);
    }
}

Here's an example of the output:

(UPDATED)

00:06:04.554 [main] WARN  o.d.nn.multilayer.MultiLayerNetwork - Cannot do truncated BPTT with non-3d inputs or labels. Expect input with shape [miniBatchSize,nIn,timeSeriesLength], got [99, 2, 1] and labels with shape [99, 1]
00:06:04.554 [main] WARN  o.d.nn.multilayer.MultiLayerNetwork - Cannot do truncated BPTT with non-3d inputs or labels. Expect input with shape [miniBatchSize,nIn,timeSeriesLength], got [99, 2, 1] and labels with shape [99, 1]
00:06:04.555 [main] WARN  o.d.nn.multilayer.MultiLayerNetwork - Cannot do truncated BPTT with non-3d inputs or labels. Expect input with shape [miniBatchSize,nIn,timeSeriesLength], got [99, 2, 1] and labels with shape [99, 1]
00:06:04.555 [main] WARN  o.d.nn.multilayer.MultiLayerNetwork - Cannot do truncated BPTT with non-3d inputs or labels. Expect input with shape [miniBatchSize,nIn,timeSeriesLength], got [99, 2, 1] and labels with shape [99, 1]
00:06:04.555 [main] WARN  o.d.nn.multilayer.MultiLayerNetwork - Cannot do truncated BPTT with non-3d inputs or labels. Expect input with shape [miniBatchSize,nIn,timeSeriesLength], got [99, 2, 1] and labels with shape [99, 1]
00:06:04.555 [main] WARN  o.d.nn.multilayer.MultiLayerNetwork - Cannot do truncated BPTT with non-3d inputs or labels. Expect input with shape [miniBatchSize,nIn,timeSeriesLength], got [99, 2, 1] and labels with shape [99, 1]
00:06:04.555 [main] WARN  o.d.nn.multilayer.MultiLayerNetwork - Cannot do truncated BPTT with non-3d inputs or labels. Expect input with shape [miniBatchSize,nIn,timeSeriesLength], got [99, 2, 1] and labels with shape [99, 1]
00:06:04.555 [main] WARN  o.d.nn.multilayer.MultiLayerNetwork - Cannot do truncated BPTT with non-3d inputs or labels. Expect input with shape [miniBatchSize,nIn,timeSeriesLength], got [99, 2, 1] and labels with shape [99, 1]
00:06:04.555 [main] WARN  o.d.nn.multilayer.MultiLayerNetwork - Cannot do truncated BPTT with non-3d inputs or labels. Expect input with shape [miniBatchSize,nIn,timeSeriesLength], got [99, 2, 1] and labels with shape [99, 1]
00:06:04.555 [main] WARN  o.d.nn.multilayer.MultiLayerNetwork - Cannot do truncated BPTT with non-3d inputs or labels. Expect input with shape [miniBatchSize,nIn,timeSeriesLength], got [99, 2, 1] and labels with shape [99, 1]

[[[3198.1614]], 

 [[2986.7781]], 

 [[3059.7017]], 

 [[3105.3828]], 

 [[2994.0127]], 

 [[3191.4468]], 

 [[3059.7017]], 

 [[2962.4341]], 

 [[3147.4412]], 

 [[3183.5991]]]

So far I've tried tried changing a number of hyperparameters, including the updater (previously Adam), the activation function in the hidden layers (previously ReLU), and the learning rate; none of which fixed the issue.

Thank you.

Adam Gibson · Answer 1 · 2023-02-01T03:57:34.160

1

This is always either a tuning issue or input data. In your case your input data is wrong.

You almost always need need to normalize your input data or your network won't learn anything. This is also true for your outputs. Your output labels should also be normalized.
Snippets below:

 //Normalize data, including labels (fitLabel=true)
        NormalizerMinMaxScaler normalizer = new NormalizerMinMaxScaler(0, 1);
        normalizer.fitLabel(true);
        normalizer.fit(trainData);              //Collect training data statistics

        normalizer.transform(trainData);
        normalizer.transform(testData);

Here's how to revert:


        //Revert data back to original values for plotting
        normalizer.revert(trainData);
        normalizer.revert(testData);
        normalizer.revertLabels(predicted);

There are different kinds of normalizers, the below just does 0 to 1. Sometimes NormalizeStandardize could be better here. That will normalize the data by subtracting the mean and dividing by the variance in the data. That will be something like this:

       NormalizerStandardize myNormalizer = new NormalizerStandardize();
        myNormalizer.fitLabel(true);
        myNormalizer.fit(sampleDataSet);

Afterwards your network should train normally.

Edit: If that doesn't work,due to the size of your dataset dl4j also has a knob (I explained this in my comment below) that normally is true where we assume your data is minibatch. On most reasonable problems (read: not 10 data points) this works. Otherwise the training can be all over the place. We can turn off the minibatch assumption with:

  ComputationGraphConfiguration conf = new NeuralNetConfiguration.Builder()
                .miniBatch(false)

The same is true for multilayer network as well.

Also of note is your architecture is vastly overkill for what is a VERY small unrealistic problem for DL. DL usually requires a lot more data to work properly. That is why you see layers stacked multiple times. For a problem like this I would suggest reducing the number of layers to 1.

At each layer what's essentially happening is a form of compression of information. When your number of data points is small, you eventually lose signal through the network when you've saturated it. Subsequent layers tend to not learn very well in that case.

edited Feb 01 '23 at 03:57

answered Feb 01 '23 at 01:28

Adam Gibson

3,055
1
10
12

thanks for the reply. Unfortunately the outputs are still almost identical after normalizing inputs using both of the above methods. I'll update the code and output to show the result with normalization. – Twisted Tea Feb 01 '23 at 03:04
So..let's move on to tuning and also caveat some things. Not all data has signal. Randomly specified data that doesn't have any inherent signal isn't guaranteed to work. One other thing to consider is you are suffering from what I like to call the "toy problem" effect. In ML, gradients are usually normalized by the batch size. You aren't really solving a real problem so you end up with worse training due to that normalization. What you want would be to avoid that since you are training on the whole thing. Let me update my answer a bit with the solution to that. – Adam Gibson Feb 01 '23 at 03:52
I updated my comment with another thing you can try. – Adam Gibson Feb 01 '23 at 03:57
yeah my training set is tiny, I tried to shorten my data for the question just to show how it was formatted but my training set is still only ~100 examples. the data should be fine, its from a statistics training set for linear regression. I tried reducing the layers down to one and turning off minibatching, but the results still have very little deviation from each other. Not too sure if that's expected; my only experience with neural nets has been MLP, but the results were much more accurate for the same training set. I updated my code & output above. – Twisted Tea Feb 01 '23 at 04:59
One other thing; I increased the number of epochs to see if the accuracy increased, and noticed a strange warning concerning the shape of my data occurs every cycle. I attached it in the output (I believe the "99" the warning refers to is the my total number of training examples) – Twisted Tea Feb 01 '23 at 05:09
Try my update tips and let me know what you find. – Adam Gibson Feb 01 '23 at 05:32
Try reducing your number of layers to 1 as well. – Adam Gibson Feb 01 '23 at 05:36
Using one layer did seem to increase the range of the outputs significantly, though it's still pretty small and severely underfits the data. I edited my code and output with the results (note I did also try a sigmoid activation function as well on that one layer, with similar results). Again, not too sure if I should expect this from such a small training set, but I did have better results with an MLP, which fit the same training set with much better accuracy. Thanks again for all the help, let me know if you have any other ideas. – Twisted Tea Feb 01 '23 at 17:45
Ah apologies. Please leave the output layer on there. Change the identity function to an alternative activation function *only* on the first layer. Try tanh or sigmoid. Let me know what you find from that. – Adam Gibson Feb 01 '23 at 21:45
my bad, I updated that output and code. Along with tanh I tried RELU & sigmoid, they all performed similarly. – Twisted Tea Feb 02 '23 at 00:55
Change your updater to ADAM and reduce your learning rate. – Adam Gibson Feb 02 '23 at 12:20
yea no luck with Adam & lowering the learning rate (0.001 & 0.0001) – Twisted Tea Feb 02 '23 at 22:04

LSTM in DL4J - All output values are the same

1 Answers1