I am training a model to do weather data prediction. I found a method on github that does it pretty well with stuff like SVM and SVC.
It uses a dataset that basically looks like this, Dhaka is a city/station name
Station Yea Month Day Rainfall dayofyear
1970-01-01 1 Dhaka 1970 1 1 0 1
1970-01-02 1 Dhaka 1970 1 2 0 2
1970-01-03 1 Dhaka 1970 1 3 0 3
1970-01-04 1 Dhaka 1970 1 4 0 4
1970-01-05 1 Dhaka 1970 1 5 0 5
There are about 3 million rows in the whole dataset and there are about a total of 35 'Station'.the author of the code uses this to specify test and train datas. And he specifies all the test and train data will only be from the entries where Station is Dhaka. Also the year <= 2015 for train and == 2016 for test.
train = df.loc[df['Year'] <= 2015]
test = df.loc[df['Year'] == 2016]
train=train[train['Station']=='Dhaka']
test=test[test['Station']=='Dhaka']
X_train=train.drop(['Station','StationIndex','dayofyear'],axis=1)
Y_train=train['Rainfall']
X_test=test.drop(['Station','StationIndex','dayofyear'],axis=1)
Y_test=test['Rainfall']
is there a way to do the same using from sklearn.model_selection import train_test_split ? Where I can limit the entries to only those with a specific station name or year? Did I explain that clearly enough? Sorry for bad english and thanks in advance.