I'm a relatively new user to sklearn and have question about using train_test_split
from sklearn.model_selection. I have a large dataframe that has shape of (96350, 156). In my dataframe is column named CountryName
that contains 160 countries, each country has about 600 instances.
Input:
df['CountryName'].unique()
Output:
array(['Aruba', 'Afghanistan', 'Angola', 'Albania', 'Andorra',
'United Arab Emirates', 'Argentina', 'Australia', 'Austria',
'Azerbaijan', 'Belgium', 'Benin', 'Burkina Faso', 'Bangladesh',
'Bulgaria', 'Bahrain', 'Bahamas', 'Bosnia and Herzegovina',
...
'Slovenia', 'Sweden', 'Eswatini', 'Seychelles', 'Chad', 'Togo',
'Thailand', 'Trinidad and Tobago', 'Tunisia', 'Turkey', 'Taiwan',
'Tanzania', 'Uganda', 'Ukraine', 'Uruguay', 'United States',
'Uzbekistan', 'Venezuela', 'Vietnam', 'South Africa', 'Zambia',
'Zimbabwe'], dtype=object)
How can I implement train_test_split
on the level of countries and not on the level of instances ? To better understand my question I made quick table which is my dataframe. How can i perform train_test_split
on country for example Aruba (so we get 70% trained data and 30% test data from this Aruba country), and do this for all countries and at the end add these trained/testing (X_train, X_test, y_train and y_test) data together in another dataframe?
To visualize:
(____part of X dataset____) (y dataset)
CountryName value1 value2 ... valueN
Aruba 1 3 ... 3
Aruba 2 4 ... 6
Aruba 3 4 ... 1
... ... ... ... ...
Sweden 5 3 ... 2
Sweden 4 7 ... 2
... ... ... ... ...
Zimbabwe 2 3 ... 9
Zimbabwe 1 2 ... 8
Zimbabwe 5 1 ... 1
Zimbabwe 5 3 ... 3
... ... ... ... ...