Large data quantities in keras model.predict

Question

I have a CNN that I define like this:

inputs = keras.Input(shape=(1024,1))
x=inputs

# 1st convolutional block
x = keras.layers.Conv1D(16, kernel_size=(3), name='Conv_1')(x)
x = keras.layers.LeakyReLU(0.1)(x)      
x = keras.layers.MaxPool1D((2), name='MaxPool_1')(x)

x = keras.layers.Flatten(name='Flatten')(x)

# Classifier
x = keras.layers.Dense(64, name='Dense_1')(x)
x = keras.layers.ReLU(name='ReLU_dense_1')(x)
x = keras.layers.Dropout(0.2)(x)
x = keras.layers.Dense(64, name='Dense_2')(x)
x = keras.layers.ReLU(name='ReLU_dense_2')(x)

I train it in one google colab session, and then I open the trained model and use keras' model.predict(dataarr) to predict results with it.

The problem is that I would like to be able to use large quantities of data to do my predictions with, but the data is saved in .txt files that become very big (>8GB) and therefore google colab doesn't have enough RAM to open the files and read all of the data into a single array.

What's the best way of handling this? I'm producing the data in C++, and I'm not an expert, but it must be possible to convert the data into binary when I write it out and convert it back when I read it. Is this an intelligent option? Or is there a way of getting keras to predict in batches, given that each set of 1024 lines in the .txt file is independent from the next set?

how can your input be greater than 8GB when your input size is set to `(1024,1)`? What are you trying to predict? Have you tried working with batches? — Ruli, Aug 19 '20 at 18:27
@Ruli As far as I understand, Input(shape) means that the input to the model has exactly 1024 neurons, as each data point that I'm passing to the model is represented in 1024 lines in the .txt file. What the model does is to take waveforms made of 1024 numbers each, and predict how many particles hit a detector to create that waveform. I haven't tried predicting in batches because I don't know how to in keras, and I don't even know if it's the most intelligent solution — Beth Long, Aug 19 '20 at 21:17

Ruli · Accepted Answer · 2020-11-06T08:48:35.390

So what is an input shape?

From keras documentation

shape: A shape tuple (integers), not including the batch size. For instance, shape=(32,) indicates that the expected input will be batches of 32-dimensional vectors. Elements of this tuple can be None; 'None' elements represent dimensions where the shape is not known.

What does it mean? Your input layer keras.Input(shape=(1024,1)) says, you are going to input 1024 vectors of one dimensional vectors, so 1024 values. As you understand it correctly, there is 1024 neurons in the input layer. A single neuron, however, doesn't work with sequence of inputs (i.e. lines) but can combine inputs from neurons from previous layer and its weights or a single value on input. Every next value provided (as from the sequence) is just an other independent evaluation. Read more about neurons here. However, convolutional layer is specific type of NN, it uses filters and is trying to find patterns in the data provided, expecting always the same shape of the data, such as same sized images or portions of signal.

In case you want to provide data with inconsistent shape, you have two options:

Split data into batches to fit the input shape and choose reasonable batch size to fit your RAM, this might however lead to information loss, since your data might have continuity that will be lost when split
Use another type of neural network suitable for sequential data - recurrent neural networks such as LSTM. These networks take encoded char/word/value as a single input and process it through network with partial memorizing the data. LSTM nets are widely used for text classification and do not require input of static size such as most NN do. You also should think about encoding your data through a hash map (if not done so yet) if you use data with set of keys, such as natural text, source code etc. You save space and it is way more intuitive for NN to work with numerical data.

As a side note, in case you don't have extremely powerful machine, you simply don't want to train/test/execute NN with such huge data (expecting you have multiple files with such size), time complexity of training with data of such size is too high and you might never get your trained model.

EDIT After further explanation from OP:

The above still applies, but not in this case, leaving it there as it might be helpful to somebody else.

About the OPs problem, the batch loading still should be applied. RAM wont get any larger as it is, so splitting the dataset into a chunks is needed. Loading i.e. 100 or 1000 lines at once should not load RAM as much - you should try out to find out where are limits of your machine. You can use the following code to load lines:

with open("log.txt") as infile:
    for line in infile:
        do_something_with(line)

The file will close after processed and lines will be freed from memory by garbage collector. You can stack lines in ndarray to process them to the predict() method. You need to provide batch_size as well if not predicting a single sample.

EDIT 2:

What you really need to do here is to load n lines at a time, thread where it is done is here. You open the file and load n by n chunks, in example I have provided on example data i have chosen chunks of 2, you can use whatever number you need, e.g. 1000.

from itertools import zip_longest
import numpy as np

n = 2  # Or whatever chunk size you want
with open("file.txt", 'rb') as f:
    for n_lines in zip_longest(*[f]*n, fillvalue=b''):
      arr = np.char.decode(np.array(n_lines),encoding='utf_8')
      print(arr)

The data I have used in sample file are as follow:

1dsds
2sdas
3asdsa
4asdsaad
5asdsaad
6dww
7vcvc
8uku
9kkk1

I have chosen odd count and 2 as chunk size, so you can see that it is appended by empty data, output of the function is following:

['1dsds\n' '2sdas\n']
['3asdsa\n' '4asdsaad\n']
['5asdsaad\n' '6dww\n']
['7vcvc\n' '8uku\n']
['9kkk1' '']

This code loads 2 lines at a time, you can then remove newlines if needed by [s.replace('\n' , '') for s in arr]

To successfully use the data returned use yield and iterate over this function:

from itertools import zip_longest
import numpy as np

def batcher(filename: str):
    n = 2  # Or whatever chunk size you want
    with open(filename, 'rb') as f:
        for n_lines in zip_longest(*[f]*n, fillvalue=b''):
          #decode the loaded byte arrays to strings 
          arr = np.char.decode(np.array(n_lines),encoding='utf_8')
          yield arr.astype(np.float)
for batch_i, arr in enumerate(batcher("file.txt")):
    out = model.predict(arr.reshape( your_shape_comes_here ))
    #do what you need with the predictions

Perhaps I haven't been clear. Each "event" that I want to evaluate is made up of 1024 data points. Therefore, 1024 input neurons to my network is the correct number. The data is not sequential, one event is entirely independent from the next, therefore I do not need an LSTM. My CNN is already trained, I just want to know the most efficient way of using it to evaluate a very large data sample — Beth Long, Aug 20 '20 at 09:01
@BethLong I have edited the answer, my bad for misundestanding what you do/need, you simply should predict on batches, that is a standard method for working with large datasets — Ruli, Aug 20 '20 at 09:27
I still don't understand how the model.predict() line will look with my data in batches. Do I have to do model.predict(one_event) line by line for the infile, where one_event will be a [1,1024] array representing one event to be predicted? Will this be a lot easier or more efficient than using a different file type instead of txt? — Beth Long, Aug 25 '20 at 14:07
@BethLong if your data is of shape `[1,1024]`, you can build 3D array of shape `[n,1,1024]` where n is number of samples - size of batch - and then feed this 3D array to the network by `model.predict(array, batch_size=n)`, the output will then be of shape `[n,1]` since you predict just a class label, first prediction to the first vector and so on, you can build such an array through the code I added in the answer above — Ruli, Aug 26 '20 at 07:17
would you mind providing a MWE for opening a large file, splitting it into batches & using the batches in prediction? I'm not an expert in Python so I'm not sure how to build this 3D array or how to use it in my NN. Do I have to do the model.predict() inside the for loop? Thanks a lot in advance for your help! — Beth Long, Sep 02 '20 at 13:29
@BethLong sorry for delay, I have already provided the code, you can open file as shown and then build for example array of 100 lines — Ruli, Sep 08 '20 at 19:17
Hi Ruli, I'm looking back at this and I still don't understand how to use the code snippets that you've provided to read in a file in batches and predict based off these batches. I would be extremely grateful if you could provide a minimum working example — Beth Long, Nov 01 '20 at 13:33
@BethLong all done, this should work when you implement it in your model — Ruli, Nov 01 '20 at 19:34
I'm currently getting the error "Cast string to float is not supported" when I do the model.predict. My file is entirely floats, there are no strings, not even any empty lines. I assume this is a problem with the batcher function, is that correct? How can I solve this issue? — Beth Long, Nov 05 '20 at 16:20
I fixed it by adding astype(np.float) to the yield. Unfortunately StackOverflow won't let me edit the answer. It now runs, but it's very slow. I wonder if there's a better solution — Beth Long, Nov 05 '20 at 16:56
@BethLong I have done my best and probably can't help you anymore, if you appreciate my help, consider accepting and/or upvoting the answer. If you think there might be more efficient way to achieve this, you might consider asking a new question with current code targeting increase of efficiency. You might have more luck than in this 2 months old question. — Ruli, Nov 06 '20 at 08:52

Large data quantities in keras model.predict

1 Answers1

Linked