1

Background: Training set with 100m rows and about 50 columns, and i have cast the dtype to the minimum types. still, the dataframe is like 8-10Gb when loaded.

Run training on AWS ec2 instances(one is 36CPU + 72RAM. another is 16CPU + 128RAM)

Problems: 1; Load data in Pandas dataframe and try with default config with xgboost, and memory soon exploded 2; Also, i tried with Dask dataframe with distributed client enabled and using dask.xgboost, it run a bit longer, but i have worker failed warnings and progress stalled.

So, is there a way for me to estimate how big RAM i should use to make sure it is enough?

here is some codes:

import dask_ml.xgboost as dxgb
import dask.dataframe as ddf

train = pd.read_parquet('train_latest',engine='pyarrow')
train = ddf.from_pandas(train, npartitions=72)
X ,y = train[feats],train[label]
X_train,y_train,X_test,y_test = make_train_test(X,y) # customized function to divide train/test

model = dxgb.XGBClassifier(n_estimators=1000, 
                          verbosity=1, 
                          n_jobs=-1, 
                          max_depth=10, 
                          learning_rate=0.1)
model.fit(X_train,y_train)
Argos.LEE
  • 139
  • 2
  • 6
  • Don't know about estimating RAM usage, but with >1 billion data points you should be looking at special-purpose solutions. Perhaps [vaex.ml](https://vaex.readthedocs.io/en/latest/index.html#) would be useful, e.g. [this tutorial](https://towardsdatascience.com/ml-impossible-train-a-1-billion-sample-model-in-20-minutes-with-vaex-and-scikit-learn-on-your-9e2968e6f385) and [accompanying notebook](https://nbviewer.jupyter.org/github/vaexio/vaex-examples/blob/master/medium-nyc-taxi-data-ml/vaex-taxi-ml-article.ipynb) – jared_mamrot Dec 02 '20 at 02:50
  • 1
    well, i tried to edit the question but failed...anyways, i kinda solved this question by applying `external memory` option offered by xgb. However, then i tried with external memory + GPU training(instance with 12Gib of NIVIDA card) , did't work out. kernel soon restarts and my GPU log tracks a surge in GPU usage, i guess need more powerful GPU instance.... – Argos.LEE Dec 02 '20 at 05:14

0 Answers0