How to deal with multiple data files using lightGBM

Question

I am trying to use lightGBM as a classifier. My data are saved in multiple csv files, but I find there is no way to directly use multiple files as the input.

I have considered to combine all the data into a big one (numpy array), but my computer doesn't have enough memory. How can I use lightGBM to deal with multiple data files when the avaliable memory is poor?

score 1 · Answer 1 · answered May 09 '18 at 15:28

Sample.

You shouldn't ever (Except certain edge cases) need to use your entire dataset if you sample CORRECTLY.

I use a DB that has over 230M records but I usually only select a RANDOM sample of anywhere from 1k-100k to create the model.

Also, you might as well split your data into training, testing and validation. That will help cut down the size per file.

score 0 · Answer 2 · answered Apr 27 '18 at 07:55

0

I guess that you are using Python.

What is the size of your data ? (num of rows x num of columns)

Lightgbm will need to load the data in-memory for training. But if you haven't done it yet, you can wisely choose a suitable datatype for every column of your data.

It can considerably reduce the memory footprint if you use dtypes such as 'uint8' / 'uint16' and help you load everything in memory.

answered Apr 27 '18 at 07:55

Florian Mutel

1,044
1
6
13

The data shape is 1.7million * 512, all float values. So they couldn't be transformed to unit8 type. Actually, I am considering the pratical application of lightGBM. In an experiment, the amount of data is fixed and we can solve this problem by adding more memeories. However, in applications, the data size is continious growing. So the lack of memory will always be a problem. Taking use of large data files might be the final solutioin. – Yuxiao Xu Apr 28 '18 at 00:42
Then I see two options: 1) Downsampling of your data, usually not taking all instance of the majority class; it will allow you to control the size of your training dataset and if you want to use all the data you can still build multiple models (cf : [example code for xgboost](https://www.kaggle.com/snowdog/xgb-sandwich/code)) 2) use another ml algorithm designed to be trained batch by batch with data on the HDD. (for example sklearn with 'partial_fit' methods or [FM-FTRL](https://www.kaggle.com/anttip/talkingdata-wordbatch-fm-ftrl-lb-0-9769)) – Florian Mutel Apr 30 '18 at 14:04

score 0 · Answer 3 · answered Oct 11 '18 at 22:57

You might want to categorize your features, then to one-hot-encode them. LightGBM works best with sparse features such as one-hot-encoded ones due to its EFB (Effective Feature Bundling) which enhances computation efficiency of LightGBM significantly. Moreover, you will definitely get rid of the floating parts of the numbers.

Think categorization as that; let’s say that values of one of the numerical features vary between 36 to 56, you can digitize it as [36,36.5,37,....,55.5,56] or [40,45,50,55] to make it categorical. Up to your expertise and imagination. You can refer to scikit-learn for one-hot-encoding, it has built-in function for that.

PS: With a numerical feature, always inspect the statistical properties of it, you can use pandas.describe() which summarizes its mean, max, min, std etc.

How to deal with multiple data files using lightGBM

3 Answers3