Python's Xgoost: ValueError('feature_names may not contain [, ] or <')

Question

Python's implementation of XGBClassifier does not accept the characters [, ] or <' as features names.

If that occurs, it raises the following:

ValueError('feature_names may not contain [, ] or <')

It would seem that the obvious solution would be to pass the equivalent numpy arrays, and get rid of the column names altogether, but if they haven't done it that must be for a reason.

What use does XGBoost have for the feature names, and what is the downside of simply passing it Numpy Arrays instead of Pandas DataFrames?

Edit: this is not a question about workarounds (those are obvious and in the question), but about why it is implemented this way

How did you manage to face that problem ? :] Only comment I see by them is: `# prohibit to use symbols may affect to parse. e.g. []<`. You can also just delete the headers from your pandas DF — Eran Moshe, Feb 07 '18 at 09:39
I solved it by using a `.values` in the fit, predict, predict_proba, etc. In fact I created a wrapper that does that so I can keep the interface and pass Pandas DataFrames at will. However I'm wondering what I'm missing by not using Pandas. They muse use the column names for something, right? — sapo_cosmico, Feb 07 '18 at 12:06
If you referring to xgboost, I don't think so. When I'm dumping a model to a .txt file they rename the "so called headers" to numbers. So the splits in the trees is like: if f[1] < 0.05 then ... (no strings. only integers) — Eran Moshe, Feb 07 '18 at 12:13

Abhimanu Kumar · Accepted Answer · 2018-05-31T22:23:11.483

I know it's late but writing this answer here for other folks who might face this. Here is what I found after facing this issue: This error typically happens if your column names have the symbols [ or ] or <. Here is an example:

import pandas as pd
import numpy as np
from xgboost.sklearn import XGBRegressor

# test input data with string, int, and symbol-included columns 
df = pd.DataFrame({'0': np.random.randint(0, 2, size=100),
                   '[test1]': np.random.uniform(0, 1, size=100),
                   'test2': np.random.uniform(0, 1, size=100),
                  3: np.random.uniform(0, 1, size=100)})

target = df.iloc[:, 0]
predictors = df.iloc[:, 1:]

# basic xgb model
xgb0 = XGBRegressor(objective= 'reg:linear')
xgb0.fit(predictors, target)

The code above will throw an error:

ValueError: feature_names may not contain [, ] or <

But if you remove those square brackets from '[test1]' then it works fine. Below is a generic way of removing [, ] or < from your column names:

import re
import pandas as pd
import numpy as np
from xgboost.sklearn import XGBRegressor
regex = re.compile(r"\[|\]|<", re.IGNORECASE)

# test input data with string, int, and symbol-included columns 
df = pd.DataFrame({'0': np.random.randint(0, 2, size=100),
                   '[test1]': np.random.uniform(0, 1, size=100),
                   'test2': np.random.uniform(0, 1, size=100),
                  3: np.random.uniform(0, 1, size=100)})

df.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in df.columns.values]

target = df.iloc[:, 0]
predictors = df.iloc[:, 1:]

# basic xgb model
xgb0 = XGBRegressor(objective= 'reg:linear')
xgb0.fit(predictors, target)

For more read this code line form xgboost core.py: xgboost/core.py. That's the check failing which the error is thrown.

Typo in your list comprehension: df.columns, not df.columns.values — Dave Liu, Apr 29 '19 at 21:56

score 2 · Answer 2 · edited Jun 20 '20 at 09:12

2

This is another regex solution.

import re

regex = re.compile(r"\[|\]|<", re.IGNORECASE)

X_train.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_train.columns.values]

edited Jun 20 '20 at 09:12

Community

1
1

answered May 19 '19 at 19:03

Yaqi Li

29
2

Gabi Lee · Answer 3 · 2021-10-09T06:03:14.003

0

Yet another solution:

X.columns = X.columns.str.translate("".maketrans({"[":"{", "]":"}","<":"^"}))

If you're interested in seeing which are the culprits:

X.columns[X.columns.str.contains("[\[\]<]")]

edited Oct 09 '21 at 06:03

answered Oct 09 '21 at 05:48

Gabi Lee

1,193
8
13

JAdel · Answer 4 · 2023-06-19T21:36:11.050

0

Just useto_numpy() to generate a numpy array:

from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=42)
clf = XGBClassifier(random_state=42)

###### Here
clf.fit(X_train.to_numpy(), y_train.to_numpy())

edited Jun 19 '23 at 21:36

answered Jun 18 '23 at 22:09

JAdel

1,309
1
7
24

score 0 · Answer 5 · answered Jul 26 '23 at 05:12

0

Here is the simplest solution....

Just use str.replace('arg1','arg2') in columns of your data. arg1-> define symbol u want to change. [^a-zA-Z0-9] this list describes all the symbols arg2 -> define symbol u want to replace it with

Example:

X_train.columns = X_train.columns.str.replace('[^a-zA-Z0-9]', '_')

**It worked Fine while fitting XGBRegressor models **

answered Jul 26 '23 at 05:12

CyBeR_cHaN_005

1
2

thanks, the question wasn't really looking for a work-around (a .values or .to_numpy does that), but rather about why they implemented it this way – sapo_cosmico Jul 26 '23 at 11:38
Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 29 '23 at 08:59

Python's Xgoost: ValueError('feature_names may not contain [, ] or <')

5 Answers5

This is another regex solution.