-1

I am building an isolation forest using sklearn and python and am running into the error 'ValueError: Number of features of the model must match the input. Model n_features is 24 and input n_features is 1.' I am trying to predict 'pages' from various size features. The actual data set I am working with has 300 rows and 14 columns. The first and second are irrelevant, then there are 10 'size' labels, then the 'pages' column, so I am not sure why it says the input is 1. I have attached some code below and what the data looks like, thank you!

X = df.iloc[:, 2:12].values
y = df['pages']
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
model = IsolationForest(n_estimators = 100, max_samples = 'auto', contamination = 'auto')
model.fit(df[['pages']])
df['anomaly']=model.fit_predict(df[['size2','size3','size4', 'size5','size6','size7','size8','size9','size10','size11']])
df['anomaly']= model.predict(df[['pages']])
print(model.predict(X_test))
print(df.head(10))
anomaly = df.loc[df['anomaly']==-1]
anomaly_index = list(anomaly.index)
print(anomaly)

enter image description here

1 Answers1

0

First, are you aware that you cannot use IsolationForest for your objective to "predict pages from various size features"? It is written in the docs that the algorithm returns the anomaly score of each sample, always ignoring any target that is passed to the model.

That being said, the error occurs because of the following lines:

df['anomaly'] = model.fit_predict(df[['size2','size3','size4', 'size5','size6','size7','size8','size9','size10','size11']])

where you fit on a dataframe with 10 variables, and then:

df['anomaly']= model.predict(df[['pages']])

where you are passing the original targets (only 1 column) to the predict function. I don't really understand the objective here. But you can definitely not do that.

afsharov
  • 4,774
  • 2
  • 10
  • 27
  • Right, I am using an isolation forest to detect the anomalies of a given data set, and this data set is also used to build a random forest classifier that predicts 'pages' from size. I am trying to make a model I can feed data into and for it to return the anomalies. Sorry for the confusion! –  Jun 12 '20 at 15:26