So what is an input shape?
From keras documentation
shape: A shape tuple (integers), not including the batch size. For instance, shape=(32,) indicates that the expected input will be batches of 32-dimensional vectors. Elements of this tuple can be None; 'None' elements represent dimensions where the shape is not known.
What does it mean? Your input layer keras.Input(shape=(1024,1))
says, you are going to input 1024 vectors of one dimensional vectors, so 1024 values. As you understand it correctly, there is 1024 neurons in the input layer. A single neuron, however, doesn't work with sequence of inputs (i.e. lines) but can combine inputs from neurons from previous layer and its weights or a single value on input. Every next value provided (as from the sequence) is just an other independent evaluation. Read more about neurons here. However, convolutional layer is specific type of NN, it uses filters and is trying to find patterns in the data provided, expecting always the same shape of the data, such as same sized images or portions of signal.
In case you want to provide data with inconsistent shape, you have two options:
- Split data into batches to fit the input shape and choose reasonable batch size to fit your RAM, this might however lead to information loss, since your data might have continuity that will be lost when split
- Use another type of neural network suitable for sequential data - recurrent neural networks such as LSTM. These networks take encoded char/word/value as a single input and process it through network with partial memorizing the data. LSTM nets are widely used for text classification and do not require input of static size such as most NN do. You also should think about encoding your data through a hash map (if not done so yet) if you use data with set of keys, such as natural text, source code etc. You save space and it is way more intuitive for NN to work with numerical data.
As a side note, in case you don't have extremely powerful machine, you simply don't want to train/test/execute NN with such huge data (expecting you have multiple files with such size), time complexity of training with data of such size is too high and you might never get your trained model.
EDIT
After further explanation from OP:
The above still applies, but not in this case, leaving it there as it might be helpful to somebody else.
About the OPs problem, the batch loading still should be applied. RAM wont get any larger as it is, so splitting the dataset into a chunks is needed. Loading i.e. 100 or 1000 lines at once should not load RAM as much - you should try out to find out where are limits of your machine. You can use the following code to load lines:
with open("log.txt") as infile:
for line in infile:
do_something_with(line)
The file will close after processed and lines will be freed from memory by garbage collector. You can stack lines in ndarray
to process them to the predict()
method. You need to provide batch_size
as well if not predicting a single sample.
EDIT 2:
What you really need to do here is to load n lines at a time, thread where it is done is here. You open the file and load n by n chunks, in example I have provided on example data i have chosen chunks of 2, you can use whatever number you need, e.g. 1000.
from itertools import zip_longest
import numpy as np
n = 2 # Or whatever chunk size you want
with open("file.txt", 'rb') as f:
for n_lines in zip_longest(*[f]*n, fillvalue=b''):
arr = np.char.decode(np.array(n_lines),encoding='utf_8')
print(arr)
The data I have used in sample file are as follow:
1dsds
2sdas
3asdsa
4asdsaad
5asdsaad
6dww
7vcvc
8uku
9kkk1
I have chosen odd count and 2 as chunk size, so you can see that it is appended by empty data, output of the function is following:
['1dsds\n' '2sdas\n']
['3asdsa\n' '4asdsaad\n']
['5asdsaad\n' '6dww\n']
['7vcvc\n' '8uku\n']
['9kkk1' '']
This code loads 2 lines at a time, you can then remove newlines if needed by [s.replace('\n' , '') for s in arr]
To successfully use the data returned use yield
and iterate over this function:
from itertools import zip_longest
import numpy as np
def batcher(filename: str):
n = 2 # Or whatever chunk size you want
with open(filename, 'rb') as f:
for n_lines in zip_longest(*[f]*n, fillvalue=b''):
#decode the loaded byte arrays to strings
arr = np.char.decode(np.array(n_lines),encoding='utf_8')
yield arr.astype(np.float)
for batch_i, arr in enumerate(batcher("file.txt")):
out = model.predict(arr.reshape( your_shape_comes_here ))
#do what you need with the predictions