Overfitting in data frame that some rows repeated

Question

I have a machine learning problem in a logistic regression algorithm. That I have a data frame where some rows and features are repeated like the below table:

feature 1	feature 2	feature 3	...	feature n-1	feature n	Target
a1	a2	a3	..	an	1	1
b1	b2	b3	..	bn	1	0
c1	c2	c3	..	cn	1	1
..	..	..	..	..	1	..
a1	a2	a3	..	an	2	..
b1	b2	b3	..	bn	2	..
c1	c2	c3	..	cn	2	..
..	..	..	..	..	2	..
a1	a2	a3	..	an	3	..
b1	b2	b3	..	bn	3	..
c1	c2	c3	..	cn	3	..
..	..	..	..	..	..	..

Is it possible to occur overfitting or underfitting with this data frame or not?
And what about a data frame that has between 6 or 8 features with about 500 rows?
I should add and notice this, rows that are repeated in features from 1 to n-1 vary in feature n.

They are identical across samples except for the last feature which is probably the label (target) so they are not informative at all. Why don't you [drop](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) the duplicates before model fitting? — seralouk, Sep 14 '22 at 12:22
@seralouk The last feature is not target (feature n is not target). — Poorya Alishah Kamandi, Sep 14 '22 at 12:36
This is not a *programming* question, please see the intro and NOTE in https://stackoverflow.com/tags/machine-learning/info. — desertnaut, Sep 14 '22 at 12:49
And if `feature-n` is not a label, then the rows are not identical... — desertnaut, Sep 14 '22 at 12:50
@PooryaAlishahKamandi ok you edited your post after my answer. Regardless, for the "almost identical" rows, only the feature n brings some information. This is not a problem per se. — seralouk, Sep 15 '22 at 09:15

Jeffrey · Answer 1 · 2022-09-14T16:35:08.490

Whether you overfit or not is due to:

the complexity of the model
the available data.

But what's important is the actual data. If you double the data by repeating it, you don't effectively change the data you have. In fact, many algorithms randomly sample from the dataset. So, having duplicates changes nothing (except if the duplicated data has a different distribution than the non-duplicated data)

As such, removing the duplication in the data will not affect whether your overfit or not.

Edit: Now, if the data is not duplicated, but rather modified, it is a different story:

where some rows and features are repeated

Then, no effect.

But if the data is modified, as the table shows, then you need to explain: Is this actual noisy measurements? Is this some random transcription/data collection error?

However, if it is not errors in the dataset but actual data, then it is important to keep it. This is not about overfitting, this is about training with the actual data.

so you mean changing the last feature has no effect on my model accuracy ?? @jeffrey — Poorya Alishah Kamandi, Sep 14 '22 at 12:42
see edit. I missed that what you stated and what the data shows was different — Jeffrey, Sep 14 '22 at 12:52
I should add and notice this, data that is repeated in features from 1 to n-1 vary in feature n. @Jeffrey — Poorya Alishah Kamandi, Sep 14 '22 at 13:00

feature 1	feature 2	feature 3	...	feature n-1	feature n	Target
a1	a2	a3	..	an	1	1
b1	b2	b3	..	bn	1	0
c1	c2	c3	..	cn	1	1
..	..	..	..	..	1	..
a1	a2	a3	..	an	2	..
b1	b2	b3	..	bn	2	..
c1	c2	c3	..	cn	2	..
..	..	..	..	..	2	..
a1	a2	a3	..	an	3	..
b1	b2	b3	..	bn	3	..
c1	c2	c3	..	cn	3	..
..	..	..	..	..	..	..

feature 1	feature 2	feature 3	...	feature n-1	feature n	Target
a1	a2	a3	..	an	1	1
b1	b2	b3	..	bn	1	0
c1	c2	c3	..	cn	1	1
..	..	..	..	..	1	..
a1	a2	a3	..	an	2	..
b1	b2	b3	..	bn	2	..
c1	c2	c3	..	cn	2	..
..	..	..	..	..	2	..
a1	a2	a3	..	an	3	..
b1	b2	b3	..	bn	3	..
c1	c2	c3	..	cn	3	..
..	..	..	..	..	..	..

Overfitting in data frame that some rows repeated

1 Answers1

feature 1	feature 2	feature 3	...	feature n-1	feature n	Target
a1	a2	a3	..	an	1	1
b1	b2	b3	..	bn	1	0
c1	c2	c3	..	cn	1	1
..	..	..	..	..	1	..
a1	a2	a3	..	an	2	..
b1	b2	b3	..	bn	2	..
c1	c2	c3	..	cn	2	..
..	..	..	..	..	2	..
a1	a2	a3	..	an	3	..
b1	b2	b3	..	bn	3	..
c1	c2	c3	..	cn	3	..
..	..	..	..	..	..	..