What is the purpose of lightgbm.Dataset() as per the docs when I can use the sklearn API to feed the data and train a model?
Any real world examples explaining the usage of lightgbm.dataset() would be interesting to learn?
What is the purpose of lightgbm.Dataset() as per the docs when I can use the sklearn API to feed the data and train a model?
Any real world examples explaining the usage of lightgbm.dataset() would be interesting to learn?
LightGBM uses a few techniques to speed up training which require preprocessing one time before training starts.
The most important of these is bucketing continuous features into histograms. When LightGBM searches splits to possibly add to a tree, it only searches the boundaries of these histogram bins. This greatly reduces the number of splits to evaluate.
I think this picture from "What Makes LightGBM Fast?" describes it well:
The Dataset object in the library is where this preprocessing happens. Histograms are created one time, and then don't need to be calculated again for the rest of training.
You can get some more information about what happens in the Dataset
object by looking at the parameters that control that Dataset, available at https://lightgbm.readthedocs.io/en/latest/Parameters.html#dataset-parameters. Some examples of other tasks:
when I can use the sklearn API to feed the data and train a model
The lightgbm.sklearn
interface is intended to make it easy to use LightGBM alongside other libraries like xgboost
and scikit-learn
. It takes in data in formats like scipy
sparse matrices, pandas
data frames, and numpy
arrays to be compatible with those other libraries. Internally, LightGBM constructs a Dataset from those inputs.