Catboost training model for huge data(~22GB) with multiple chunks

Question

I'm trying to train a CatboostClassifier with around 22GB of data in csv file which has around 50 columns. I tried loading all data at once in pandas dataframes but couldn't do that. Is there anyway I could train the model with multiple chunks of dataframes in catboost?

Did you try the method recommended to you [here](https://github.com/catboost/catboost/issues/152)? — SiHa, Nov 03 '17 at 11:42
@SiHa That worked upto some level. However, classifier program got killed while model.fit(), which depends on various factors like data size(no. of rows), iterations count, depth size.. Ideal solution would be running in distributed environment which is in their pipeline.. — Mathan Kumar, Nov 03 '17 at 12:38

score 9 · Answer 1 · 2020-09-15T09:04:03.063

Catboost incremental fit for huge data files.

You can train your model incrementally as long as you use CPU and init_model as fit parameters. Here is an example of how to do that:

from catboost import CatBoostClassifier
import pandas as pd
from sklearn.model_selection import train_test_split

clf = CatBoostClassifier(task_type="CPU",
                     iterations=2000,
                     learning_rate=0.2,
                     max_depth=1)
chunk=pd.read_csv('BigDataFile.csv',chunksize=100000)
for i,ds in enumerate(chunk):
    W=ds.values
    X=W[:,:-1].astype(float)
    Y=W[:,-1].astype(int) 
    del W
    if i==0:
        X_train, X_val, Y_train, Y_val = train_test_split(X, Y,                                                          
                                                 train_size=0.80,
                                                 random_state=1234)
        del X,Y
        clf.fit(X_train, Y_train, 
                eval_set=(X_val, Y_val), 
    else:
        clf.fit(X, Y,      
                eval_set=(X_val, Y_val),
                init_model='model.cbm') 
    clf.save_model('model.cbm')         # save model so is loaded in the next step

And you are good to go. Only works on CPU. Do not use snapshot file or best_model. The model file will be loaded and the training will be performed incrementally after the initial step as long as you have data left.

For the sake of completeness - batched training will also do the job. Slightly more coding, but can use GPUs. https://catboost.ai/en/docs/concepts/python-usages-examples#batch-training — Anatoly Alekseev, Jan 12 '22 at 11:46

score 0 · Answer 2 · answered Nov 03 '17 at 11:18

0

I am not sure but you can try the option of save_snapshot and snapshot_file in the model. The purpose was to be able to continue learning if it was interupted.

model = CatBoostClassifier(iterations=50, 
save_snapshot = True,
snapshot_file = 'model_binary_snapshot.model' 
random_seed=42)

It will save the model under 'model_binary_snapshot.model' and you can reload and continue learning.

model2 = CatBoostClassifier( )
model2.load_model('model_binary_snapshot.model')

answered Nov 03 '17 at 11:18

youpi

75
1
8

2

Snapshot cannot be used on incremental learning. It is meant for resuming an interrupted training. not for incremental training. The data is not the same so it will bark at you – Sep 14 '20 at 17:02

Catboost training model for huge data(~22GB) with multiple chunks

2 Answers2

Catboost incremental fit for huge data files.