0

The title may not be clear but I will try to explain my problem as clearly as possible. I have dummy data i.e.

data = {'month': ['2022-01-01', '2022-02-01', '2022-03-01', '2022-01-01', '2022-02-01', '2022-03-01', '2022-01-01', '2022-02-01', '2022-03-01','2022-01-01',], 
'Name': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', ],
'Age': [23, 24, 34, 45, 56, 46, 40, 30, 20, 50,],
'Experience': [1, 2, 4, 6, 7, 7, 5, 10, 9, 8], 
'salary': [50, 60, 70, 80, 80, 90, 55, 75, 100, 95,],
}

df = pd.DataFrame(data)
df

enter image description here

In machine learning, we will split the data sets into 60-40/70-30/80-20, and so on. Before splitting data, we dropped unnecessary data for the training and separate input and output. Like below:

labels=df[['month', 'Name']]
y=df[['salary']]
X=df.drop(['month', 'Name', 'salary'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=42)

And after splitting, we can assume that this is our training data. enter image description here

And this is our test data. enter image description here

So, I was wondering how can we add back month, name, and salary column back to train data and test data to make sure that row belongs to a particular month and Name?

Expected results for test data enter image description here

Bad Coder
  • 177
  • 11
  • 2
    Why not _first_ split into train and val, and _then_ separate the columns? – ShlomiF Sep 21 '22 at 17:58
  • Yeah, we can do that as well I guess but the steps will be repeated so, I was looking this way. If we can exactly match the rows then it will be more efficient. – Bad Coder Sep 21 '22 at 18:22
  • You need to explain why you drop the columns then split. Because as @ShlomiF says, the obvious thing to do is to split X without dropping month and name. (Don't add Salary back, that's the target.) – Matt Hall Sep 21 '22 at 18:26
  • So, in the validation set, I added salary back just to compare the true result and prediction result. I dropped the column and then split because, I thought that was the standard practice. – Bad Coder Sep 21 '22 at 18:38

0 Answers0