1

What is the purpose of lightgbm.Dataset() as per the docs when I can use the sklearn API to feed the data and train a model?

Any real world examples explaining the usage of lightgbm.dataset() would be interesting to learn?

1 Answers1

2

LightGBM uses a few techniques to speed up training which require preprocessing one time before training starts.

The most important of these is bucketing continuous features into histograms. When LightGBM searches splits to possibly add to a tree, it only searches the boundaries of these histogram bins. This greatly reduces the number of splits to evaluate.

I think this picture from "What Makes LightGBM Fast?" describes it well:

enter image description here

The Dataset object in the library is where this preprocessing happens. Histograms are created one time, and then don't need to be calculated again for the rest of training.

You can get some more information about what happens in the Dataset object by looking at the parameters that control that Dataset, available at https://lightgbm.readthedocs.io/en/latest/Parameters.html#dataset-parameters. Some examples of other tasks:

  • optimization for sparse features
  • filtering out features that are not splittable

when I can use the sklearn API to feed the data and train a model

The lightgbm.sklearn interface is intended to make it easy to use LightGBM alongside other libraries like xgboost and scikit-learn. It takes in data in formats like scipy sparse matrices, pandas data frames, and numpy arrays to be compatible with those other libraries. Internally, LightGBM constructs a Dataset from those inputs.

James Lamb
  • 1,662
  • 12
  • 16
  • The last paragraph was something I was looking for. – Suffer Surfer Feb 03 '21 at 17:53
  • Was there any advantage using the lightgbm.Dataset() compare to lightgbm.sklearn? – didi Feb 16 '22 at 03:38
  • Every time you perform training with the estimators in `lightgbm.sklearn`, `lightgbm` will need to re-do the work of constructing a `Dataset` from your raw data (e.g. `scipy` and `numpy` matrics). If you instead use `lgb.Dataset()` then train with `lgb.train()`, you'll only need to do that Dataset construction work one time. Once you've constructed a `Dataset`, you can delete the raw data, freeing a possibly-significant amount of memory. That isn't an option if you use the estimators in `lightgbm.sklearn`. – James Lamb Feb 16 '22 at 05:34