The title may not be clear but I will try to explain my problem as clearly as possible. I have dummy data i.e.
data = {'month': ['2022-01-01', '2022-02-01', '2022-03-01', '2022-01-01', '2022-02-01', '2022-03-01', '2022-01-01', '2022-02-01', '2022-03-01','2022-01-01',],
'Name': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', ],
'Age': [23, 24, 34, 45, 56, 46, 40, 30, 20, 50,],
'Experience': [1, 2, 4, 6, 7, 7, 5, 10, 9, 8],
'salary': [50, 60, 70, 80, 80, 90, 55, 75, 100, 95,],
}
df = pd.DataFrame(data)
df
In machine learning, we will split the data sets into 60-40/70-30/80-20, and so on. Before splitting data, we dropped unnecessary data for the training and separate input and output. Like below:
labels=df[['month', 'Name']]
y=df[['salary']]
X=df.drop(['month', 'Name', 'salary'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=42)
And after splitting, we can assume that this is our training data.
So, I was wondering how can we add back month, name, and salary
column back to train data and test data to make sure that row belongs to a particular month and Name
?