-2
from sklearn.model_selection import  train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=87)

plt.scatter(x_train[:, 0], x_train[:,1], c=y_train)

Can someone explain to me about the code, what is the different between train and test and how does [:, 0] and [:,1] about?

1 Answers1

0

train_test_split() divides the data X (independent variables) and y (dependent variable) into a 80/20 split (train_size = 0.8, test_size = 0.2, but you only specify test_size). For example, your dataset consists of 100 rows then your x_train and y_train would consist of 80 random rows from the original 100 rows. The rest belongs to the x_test and y_test data (e.g. 20 rows). By setting the random_state, you ensure that the random splits are always reproducible when you enter the same number. This is done to prevent data leakage or spill-over during model training. More info can be found here.

The plot function then creates a scatterplot, where all the rows from the first column (of x_train) are used to create the x-axis coordinates and all the rows from the second column are used to create the y-axis coordinates, where the datapoints are coloured based on y_train values.

JonnDough
  • 827
  • 6
  • 25