17

I'm following a xgboost example on their main git at - https://github.com/dmlc/xgboost/blob/master/demo/guide-python/basic_walkthrough.py#L64

in this example they are reading files directly put into dMatrix -

dtrain = xgb.DMatrix('../data/agaricus.txt.train')
dtest = xgb.DMatrix('../data/agaricus.txt.test')

I looked at dMatrix code, seems there is no way to briefly look at how the data is structured - as we normally do in pandas with pandas.DataFrame.head()

in xgboost documentation it mentions that we can convert numpy.ndarray to xgboost.dMatrix - can we somehow convert it back - from xgboost.dMatrix to numpy.ndarray, or perhaps pandas dataFrame? I don't see possible way from their code - but perhaps someone knows a way?

Or is there a way to briefly look at how data is like in xgboost.dMatrix?

Thanks in advance, Howard

howard
  • 255
  • 1
  • 4
  • 12

3 Answers3

6

To elaborate on @jcaine's answer, you can use sklearn to load the files, then convert them to ordinary numpy arrays:

from sklearn.datasets import load_svmlight_file
train_data = load_svmlight_file('demo/data/agaricus.txt.train')
X = train_data[0].toarray()
y = train_data[1]

I haven't found a way to directly convert from dMatrix to numpy arrays yet.

Peter
  • 567
  • 3
  • 10
2

Howard,

I believe that the xgb.DMatrix assumes the libsvm data format. You can get this data into a sparse CSR matrix using scikit's load_svmlight_file: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html.

You can then partition the response variable and the features using the example at the bottom of the page.

jcaine
  • 51
  • 4
0

The package dmatrix2np should do exactly that. From their docs:

from dmatrix2np import dmatrix_to_numpy

converted_np_array = dmatrix_to_numpy(dmatrix)

If you don't have missing values then I think that the following should also work

dmatrix.get_data().toarray()

the issue with missing values is that they will be treated as zeros instead of missing when you do that.

Yann Dubois
  • 1,195
  • 15
  • 16