Splitting dataset for training and testing row wise

Question

I want to split my dataset into training and test datasets based on years. The idea is to put the rows with years ranging form 2009-2017 in train dataset and the 2018 data in test dataset. Splitting the datasets was easy for the most part but my models are throwing a lot of indexing issues.

X = ((df[df['Year'] < 2018]))
X_train = np.array(X.drop(['Usage'], 1))
X_test = np.array(X['Usage'])
y =((df[df['Year'] > 2017]))
y_train = np.array(y.drop(['Usage'], 1))
y_test = np.array(y['Usage'])

This is how I plan on splitting the data. The usage column is my forecast column and contains continuous values. Applying a simple RandomForestRegressor() gave me this error in return

ValueError: Number of labels=14495 does not match number of samples=382772

aditya my regressor model was pretty basic but i'm attaching the code any way. the columns being passed in X are as follows: X= [Cust_Id', 'Usage', 'Plan_Group', 'Contract_Type', 'Cust_Status','Premise_Zip', 'Year', 'Month']

model = RandomForestRegressor()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
# evaluate predictions
print(model.score(X_test, y_test))
# accuracy = accuracy_score(y_test, (y_pred < 0.5).astype(int))

I want to split dataset for training and testing by rows. usually you define a column as your forecast column and pass it in y_test and y_train. in this case tho, i want to split it by rows. my goal is to predict the 'Usage values for 2018 by training the model on 2009-2017 data' — Fareen Walani, Oct 10 '18 at 08:45
@FareenWalani add little more code - maybe the lines where you are calling RandomForestRegressor(). I found the blunder you're causing here but let's take the things more formally stack overflow way. I'm interested in seeing how you are using the parameters. And get a gist of what kind of data X contains from the start. Your understanding of X and y is getting a little wonky - that's what is causing the problem. — Aditya, Oct 10 '18 at 09:16

score 0 · Accepted Answer · answered Oct 10 '18 at 09:42

0

For most of the algorithms in sklearn stack, you have a standard notation: X, capital letter, is usually an array (even if there is one feature) and represents each data point in vector form. y, small letter, is usually a vector that denotes labels, e.g. class label, or value of a regression element.

You created X and y both as a dataframe generated by the Year attribute. Instead you have to split into X_train and X_test.

X = df.drop(['Usage'],1)
X_train = df[df['Year'] < 2018]
X_test = df[df['Year'] > 2017]
y_train = df[df['Year'] < 2018]
y_test = df[df['Year'] > 2017]
y_train = y_train['Usage']
y_test = y_test['Usage']

And then you train on the basis of X_train and y_train

model = RandomForestRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

This is not the best way though. Will come back to edit the answer but this should be enough to get you going for now.

answered Oct 10 '18 at 09:42

Aditya

3,080
24
47

incredible! this did wonders for me. can you explain what i was doing wrong here though? and what you did right? i ind of get it but i kind of dont – Fareen Walani Oct 10 '18 at 09:56
I understand. You are probably following a tutorial which is not well-written. Remember - X is a matrix, y is a vector. X is data features, y is label. X is input, y is output. And any model.fit() in sklearn takes two parameters - input, and output. clf.fit(X, y) is the standard way. – Aditya Oct 10 '18 at 10:00
Another stackoverflow thing, if you like the answer, remember to upvote it. If the answer is the best one in the thread, accept it as best (what you already did). Both these are important as they help us maintain the quality of content on stackoverflow. Upvoting and Accepting are the best ways to say thanks and keep us motivated to spend time out of our workday to help other peers like you. – Aditya Oct 10 '18 at 10:02
i understand adn will do just that. one more thing though, i applied both model.score and accuracy_score here. model_score is giving me fairly decent results however accuracy_score is 0.0. am i doing something wrong here or does the model not support accuracy_score? – Fareen Walani Oct 10 '18 at 10:07
In regression, you need R^2 score or something. model.score(X_test, y_test) would do just that. I don't understand accuracy_score(). How're you calling it? – Aditya Oct 10 '18 at 10:12
yeah i just read the documentations, my bad. thank you for the help! – Fareen Walani Oct 10 '18 at 10:16

Splitting dataset for training and testing row wise

1 Answers1