convert python xgboost dMatrix to numpy ndarray or pandas DataFrame

Question

I'm following a xgboost example on their main git at - https://github.com/dmlc/xgboost/blob/master/demo/guide-python/basic_walkthrough.py#L64

in this example they are reading files directly put into dMatrix -

dtrain = xgb.DMatrix('../data/agaricus.txt.train')
dtest = xgb.DMatrix('../data/agaricus.txt.test')

I looked at dMatrix code, seems there is no way to briefly look at how the data is structured - as we normally do in pandas with pandas.DataFrame.head()

in xgboost documentation it mentions that we can convert numpy.ndarray to xgboost.dMatrix - can we somehow convert it back - from xgboost.dMatrix to numpy.ndarray, or perhaps pandas dataFrame? I don't see possible way from their code - but perhaps someone knows a way?

Or is there a way to briefly look at how data is like in xgboost.dMatrix?

Thanks in advance, Howard

It is possible by dmatrix2np package, you can see the code here: github.com/aporia-ai/dmatrix2np — Nimrod Carmel, Sep 02 '21 at 16:53

score 6 · Answer 1 · answered Nov 04 '16 at 19:20

To elaborate on @jcaine's answer, you can use sklearn to load the files, then convert them to ordinary numpy arrays:

from sklearn.datasets import load_svmlight_file
train_data = load_svmlight_file('demo/data/agaricus.txt.train')
X = train_data[0].toarray()
y = train_data[1]

I haven't found a way to directly convert from dMatrix to numpy arrays yet.

score 2 · Answer 2 · answered Jun 01 '16 at 16:36

Howard,

I believe that the xgb.DMatrix assumes the libsvm data format. You can get this data into a sparse CSR matrix using scikit's load_svmlight_file: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html.

You can then partition the response variable and the features using the example at the bottom of the page.

score 0 · Answer 3 · answered Jan 08 '23 at 20:46

The package dmatrix2np should do exactly that. From their docs:

from dmatrix2np import dmatrix_to_numpy

converted_np_array = dmatrix_to_numpy(dmatrix)

If you don't have missing values then I think that the following should also work

dmatrix.get_data().toarray()

the issue with missing values is that they will be treated as zeros instead of missing when you do that.

convert python xgboost dMatrix to numpy ndarray or pandas DataFrame

3 Answers3