I want to split my dataset into training and test datasets based on years. The idea is to put the rows with years ranging form 2009-2017 in train dataset and the 2018 data in test dataset. Splitting the datasets was easy for the most part but my models are throwing a lot of indexing issues.
X = ((df[df['Year'] < 2018]))
X_train = np.array(X.drop(['Usage'], 1))
X_test = np.array(X['Usage'])
y =((df[df['Year'] > 2017]))
y_train = np.array(y.drop(['Usage'], 1))
y_test = np.array(y['Usage'])
This is how I plan on splitting the data. The usage column is my forecast column and contains continuous values. Applying a simple RandomForestRegressor() gave me this error in return
ValueError: Number of labels=14495 does not match number of samples=382772
aditya my regressor model was pretty basic but i'm attaching the code any way. the columns being passed in X are as follows: X= [Cust_Id', 'Usage', 'Plan_Group', 'Contract_Type', 'Cust_Status','Premise_Zip', 'Year', 'Month']
model = RandomForestRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# evaluate predictions
print(model.score(X_test, y_test))
# accuracy = accuracy_score(y_test, (y_pred < 0.5).astype(int))