I have a 36000 columns (0 or 1) and 26500 lines .csv file, representing the input of my training set. And I have a 1 column and 26500 lines .csv file, reprensenting the output (0 or 1)
I use sklearn (separating my datas on 80/20 : train/test) to train my model and validate it.
First question : I don't know how to choose the best algorithm adapted to this problem
clf = MLPClassifier(solver='lbfgs',alpha=1e-4, hidden_layer_sizes=(5, 5), random_state=1)
I tried that one for example. But how can I know that it's the best one ? I can't try all the algorithms, too long.
Second question : I got a memory problem : I had to separate my files on 14 differents files (2000 lines per file) because I couldn't open the entire file with python :
with open(file_i, 'rt') as csvfile:
lecteur = csv.reader(csvfile, delimiter=',')
for ligne in lecteur:
# ...
So now I can open the file and create a list with my datas, but it impossible to train the model on all 14 files because of the error "OOM when allocating tensor of shape"
I tried to use the parameter warm_start = True because I found that it could allow the model to reuse the solution of the previous call to fit the model.
def train(clf,means,output):
clf.fit(means,output)
return clf
for i in range(0,14):
means = import_input(i)
output = import_output(i,means)
clf = MLPClassifier(solver='lbfgs',alpha=1e-4, hidden_layer_sizes=(30000, 5), random_state=1, warm_start = True)
clf = train(clf,means,output)
But it doesn't work.
Third question : When I'll find the best algorithm adapted to my problem, how can I find the best parameters ? I tried to use a genetic algorithm to find these, but I got the same memory problem when I tried to generate the 20 generations.