Sklearn Choose best algorithm and handle memory issue

Question

I have a 36000 columns (0 or 1) and 26500 lines .csv file, representing the input of my training set. And I have a 1 column and 26500 lines .csv file, reprensenting the output (0 or 1)

I use sklearn (separating my datas on 80/20 : train/test) to train my model and validate it.

First question : I don't know how to choose the best algorithm adapted to this problem

clf = MLPClassifier(solver='lbfgs',alpha=1e-4, hidden_layer_sizes=(5, 5), random_state=1)

I tried that one for example. But how can I know that it's the best one ? I can't try all the algorithms, too long.

Second question : I got a memory problem : I had to separate my files on 14 differents files (2000 lines per file) because I couldn't open the entire file with python :

with open(file_i, 'rt') as csvfile:
    lecteur = csv.reader(csvfile, delimiter=',')
    for ligne in lecteur:
        # ...

So now I can open the file and create a list with my datas, but it impossible to train the model on all 14 files because of the error "OOM when allocating tensor of shape"

I tried to use the parameter warm_start = True because I found that it could allow the model to reuse the solution of the previous call to fit the model.

def train(clf,means,output):
    clf.fit(means,output)
    return clf 

for i in range(0,14):
    means = import_input(i)
    output = import_output(i,means)
    clf = MLPClassifier(solver='lbfgs',alpha=1e-4, hidden_layer_sizes=(30000, 5), random_state=1, warm_start = True)
    clf = train(clf,means,output)

But it doesn't work.

Third question : When I'll find the best algorithm adapted to my problem, how can I find the best parameters ? I tried to use a genetic algorithm to find these, but I got the same memory problem when I tried to generate the 20 generations.

you can use a **randomly choosen** part of your dataset (let's say 5 to 10 percent) and test it for a lot of different algorithms. Take the best (or top 5) algorithm(s) you have and run them with your complete set. Same for parameter optimization Try different parametercombinations with a small part of your dataset. (sklearn brings gridsearch for that task) — Florian H, Feb 08 '18 at 10:50

score -1 · Answer 1 · answered Feb 08 '18 at 22:12

You can use Pipeline combined with GridSearchCV to select the best parameters and algorithm based on scoring criteria. In particular, for feature selection, you use classes under scikit's dimension reduction or feature selection here. For example of use, you can find it here.

Sklearn Choose best algorithm and handle memory issue

1 Answers1