How can I do a train_test_split in sklearn but limit/specify the output according to a certain member of a column? Closed

Question

I am training a model to do weather data prediction. I found a method on github that does it pretty well with stuff like SVM and SVC.

It uses a dataset that basically looks like this, Dhaka is a city/station name

              Station   Yea  Month Day Rainfall dayofyear
1970-01-01  1   Dhaka   1970    1   1   0           1
1970-01-02  1   Dhaka   1970    1   2   0           2
1970-01-03  1   Dhaka   1970    1   3   0           3
1970-01-04  1   Dhaka   1970    1   4   0           4
1970-01-05  1   Dhaka   1970    1   5   0           5

There are about 3 million rows in the whole dataset and there are about a total of 35 'Station'.the author of the code uses this to specify test and train datas. And he specifies all the test and train data will only be from the entries where Station is Dhaka. Also the year <= 2015 for train and == 2016 for test.

train = df.loc[df['Year'] <= 2015]
test = df.loc[df['Year'] == 2016]
train=train[train['Station']=='Dhaka']
test=test[test['Station']=='Dhaka']

X_train=train.drop(['Station','StationIndex','dayofyear'],axis=1)
Y_train=train['Rainfall']

X_test=test.drop(['Station','StationIndex','dayofyear'],axis=1)
Y_test=test['Rainfall']

is there a way to do the same using from sklearn.model_selection import train_test_split ? Where I can limit the entries to only those with a specific station name or year? Did I explain that clearly enough? Sorry for bad english and thanks in advance.

`train_test_split` works when you want to do random subsets for training and test. This function works randomly. In the code you showed, the train and test split is not random but according to the year. What are you trying to accomplish? Random subsets or according to the year? — Bossipo, Aug 15 '20 at 12:21
i want to do it according to the year/station/month name. for example if i specify month '3' then all the results in split will all have month == '3'. i was wondering if i can do something similar with test train split. — CatVI, Aug 15 '20 at 12:31
or is it just something i should stick with the code above for easiness. I wanted to convert that code to train_test_split since I have been seeing this used everywhere, instead of the code above. — CatVI, Aug 15 '20 at 12:41
If you want to do it according to some variable the answer I'm afraid it is no, you cannot do it with `train_test_split`. As [the documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) says, `train_test_split`is for random differentiation of train and test subsets. Anyway, if you do it according to a variable, pandas is the best option. — Bossipo, Aug 15 '20 at 12:43
I see. thank you for the answer. that cleared up a bunch of confusion. I am new to python and machine learning in general and am learning as I am doing excersizes. It's just that some books in my country could have been more ... precise.. in any case, thanks. — CatVI, Aug 15 '20 at 12:46

How can I do a train_test_split in sklearn but limit/specify the output according to a certain member of a column? Closed

0 Answers0