0

I'm trying to train an XGBoost model which has also categorical variable. I'd like to avoid onehot encoding and I saw it is now possible using enable_categorical=True. I formatted my dataframe but when I try to generate the DMatrix I get the error below. I also attach a very simple example that recapitulate the error.

import xgboost as xgb
import numpy as np

test = pd.DataFrame({'out': ["a","b"],'features': [np.array(["house","horse","something", "NA" ]), np.array(["house","NA","NA", "NA" ]) ]})

X_train = test['features'].to_json()
y_train = test['out'].to_json()

xgb.DMatrix(X_train, label=y_train)

Then I get this warning/error:

[14:32:11] WARNING: ../src/data/data.cc:868: No format parameter is provided in input uri.  Choosing default parser in dmlc-core.  Consider providing a uri parameter like: filename?format=csv
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/xgboost/core.py", line 620, in inner_f
    return func(**kwargs)
  File "/usr/local/lib/python3.8/site-packages/xgboost/core.py", line 743, in __init__
    handle, feature_names, feature_types = dispatch_data_backend(
  File "/usr/local/lib/python3.8/site-packages/xgboost/data.py", line 964, in dispatch_data_backend
    return _from_uri(data, missing, feature_names, feature_types)
  File "/usr/local/lib/python3.8/site-packages/xgboost/data.py", line 880, in _from_uri
    _check_call(_LIB.XGDMatrixCreateFromFile(c_str(data),
  File "/usr/local/lib/python3.8/site-packages/xgboost/core.py", line 279, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [14:32:11] ../src/data/data.cc:874: Encountered parser error:
[14:32:11] ../dmlc-core/src/io/local_filesys.cc:86: LocalFileSystem.GetPathInfo: {"0":["house","horse","something","NA"],"1":["house","NA","NA","NA"]} error: No such file or directory
Stack trace:
  [bt] (0) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x83a293) [0x7f01eb5ea293]
  [bt] (1) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x83c13c) [0x7f01eb5ec13c]
  [bt] (2) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x822929) [0x7f01eb5d2929]
  [bt] (3) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x822e1e) [0x7f01eb5d2e1e]
  [bt] (4) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x812ca6) [0x7f01eb5c2ca6]
  [bt] (5) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x81312e) [0x7f01eb5c312e]
  [bt] (6) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x7f2210) [0x7f01eb5a2210]
  [bt] (7) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x7d4141) [0x7f01eb584141]
  [bt] (8) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x214294) [0x7f01eafc4294]


Stack trace:
  [bt] (0) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x20b233) [0x7f01eafbb233]
  [bt] (1) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xfc7ad) [0x7f01eaeac7ad]
  [bt] (2) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGDMatrixCreateFromFile+0xdf) [0x7f01eaef762f]
  [bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.7(+0x6d1d) [0x7f0248f58d1d]
  [bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.7(+0x6289) [0x7f0248f58289]
  [bt] (5) /usr/local/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x336) [0x7f0248f75477]
  [bt] (6) /usr/local/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0xae9e) [0x7f0248f70e9e]
  [bt] (7) /usr/local/bin/../lib/libpython3.8.so.1.0(_PyObject_MakeTpCall+0x87) [0x7f024f004437]
  [bt] (8) /usr/local/bin/../lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x41f7) [0x7f024f031d07]

Does anyone have suggestions on how it can be solved? Is the format ok?

-----
numpy               1.23.5
pandas              2.0.2
xgboost             1.7.5
-----
Python 3.8.16 (default, May 23 2023, 14:26:40) [GCC 10.2.1 20210110]
Linux-5.10.104-linuxkit-x86_64-with-glibc2.2.5
-----

EDIT: I didn't give much context in my original question, but I'd prefer not to split features into different columns because of the data itself. The reason for this is that the way features appears is not ordered, so I could have the same feature in column1 at times and columnN in other cases. I imagined this problem could be overcome if all the features are part of the same array. Could something similar could be achieved for categorical values? I tried X_train = np.vstack(test['features'].apply(lambda x: x.astype('category') )) but then I get the error: ValueError: could not convert string to float: 'chain' in my DMatrix. Is training on an array something achievable?

Lu_Ste
  • 21
  • 6

1 Answers1

1

The data parameter can take one of the following:

  • os.PathLike
  • string (path from file)
  • numpy.array
  • scipy.sparse
  • pd.DataFrame
  • dt.Frame
  • cudf.DataFrame
  • cupy.array
  • dlpack

but not a JSON string.

So you have to pass a DataFrame or a Numpy array. However you have to convert as numeric (or category). Try something like:

X_train = (pd.DataFrame(np.vstack(test['features']))
             .replace('NA', np.nan)
             .add_prefix('feat_')
             .apply(lambda x: pd.factorize(x)[0]))
y_train = pd.factorize(test['out'])[0]

dmat = xgb.DMatrix(X_train, label=y_train)

Output:

>>> dmat
<xgboost.core.DMatrix at 0x7f225b6483d0>

>>> X_train
   feat0  feat1  feat2  feat3
0      0      0      0     -1
1      0     -1     -1     -1

>>> y_train
array([0, 1])

EDIT:

I'd prefer not to split features into different columns because of the data itself. The reason for this is that the way features appears is not ordered, so I could have the same feature in column1 at times and columnN in other cases

XGBoost is a Decision-Tree algorithm so you need columns. Instead of pd.factorize, you can use pd.get_dummies:

X_train = (pd.get_dummies(test['features'].explode()
             .loc[lambda x: x != 'NA']).astype(int)
             .groupby(level=0).max())
print(X_train)

# Output
   horse  house  something
0      1      1          1
1      0      1          0
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • Thanks, Corralien. Your code worked. I guess I didn't give much context in the original question. I edited it to add some more info on why I'd need to keep the features column in an array-like structure. Would you be able to help? – Lu_Ste Jun 06 '23 at 16:18
  • Or if XGboost cannot coop with lists, could you suggest any other algorithm? – Lu_Ste Jun 06 '23 at 16:53
  • I updated my answer related to your edit. If you use `pd.get_dummies`, regardless of the order of your features, your dataframe will always be consistent. Let me know if it's worked (or not) :) – Corralien Jun 06 '23 at 18:57