Saving and Loading lightgbm Dataset

Question

I am trying to save and load lightgbm datasets using the save_binary command.

The following seems to work for the saving part:

import numpy as 
import lightgbm as lgb

data = lgb.Dataset(np.array([[1,2],[12,2]]))
data.save_binary('test.bin')

But so far, I have not been able to load the dataset back. Does anyone have an idea how I should proceed here?

Many thanks!

_How_ are you trying to load the dataset? Please provide a [mcve] and explain what exactly doesn't work and how you're expecting it to work. — ForceBru, Aug 17 '21 at 09:51
Thanks for your input. Please read the question, I am not facing something not working as you suggest, but simply asking if there is an efficient way to load a lightgbm dataset saved with 'save_binary' (see the reproducible example). — DvdG, Aug 17 '21 at 10:22

score 1 · Accepted Answer · answered Aug 18 '21 at 01:59

Short Answer

You can create a new Dataset from a file created with .save_binary() by passing a path to that file to the data argument of lgb.Dataset().

Try this example with Python 3.7, numpy==1.21.0, scikit-learn==0.24.1, and lightgbm==3.2.1.

import lightgbm as lgb
from numpy.testing import assert_equal
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

# construct a Dataset from arrays in memory
dataset_in_mem = lgb.Dataset(
    data=X,
    label=y
)
dataset_in_mem.construct()

# save that dataset to a file
dataset_in_mem.save_binary('test.bin')

# create a new Dataset from that file
dataset_from_file = lgb.Dataset(data="test.bin")
dataset_from_file.construct()

# confirm that the Datasets are the same
print("--- X ---")
print(f"num rows: {X.shape[0]}")
print(f"num features: {X.shape[1]}")

print("--- in-memory dataset ---")
print(f"num rows: {dataset_in_mem.num_data()}")
print(f"num features: {dataset_in_mem.num_feature()}")

print("--- dataset from file ---")
print(f"num rows: {dataset_from_file.num_data()}")
print(f"num features: {dataset_from_file.num_feature()}")

# check that labels are the same
assert_equal(dataset_in_mem.label, y)
assert_equal(dataset_from_file.label, y)

--- X ---
num rows: 569
num features: 30
--- in-memory dataset ---
num rows: 569
num features: 30
--- dataset from file ---
num rows: 569
num features: 30

Description

LightGBM training requires some pre-processing of raw data, such as binning continuous features into histograms and dropping features that are unsplittable. This pre-processing is done one time, in the "construction" of a LightGBM Dataset object.

In the Python package (lightgbm), it's common to create a Dataset from arrays in memory. If you want to then re-use that Dataset many times (for example, to perform hyperparameter tuning) without needing to repeat that construction work, you can do it one time and then save the Dataset to a file with .save_binary().

When you want to create a new Dataset object in memory, you can pass a filepath to the data argument in lgb.Dataset(), as shown in the sample code above.

NOTE: The Dataset object stored to disk will not include your raw data. So, in the sample code above, dataset_from_file.data is None. This is done for efficiency...once LightGBM has created its own "constructed" representation of the training data, it no longer needs the raw data.

Saving and Loading lightgbm Dataset

1 Answers1