3

Why is .loc[] producing duplicate rows in my DataFrame? I'm trying to select a few columns from m3, a DataFrame with 47 columns,to create a new DataFrame called output.

The problem: after accessing m3's columns with .loc[], output has way more duplicates than m3 started with. Where could these duplicates have come from? I haven't found anything online about .loc[] duplicating rows. The output DataFrame is declared on the line that reads output = m3.loc[...], by the way.

The Code:

print("ARE THERE DUPLICATES in m3? ")
print(m3.duplicated().loc[lambda x: x==True])

output = m3.loc[:,["PLC_name", "line", "track", "notes", "final_source", 
"s_name", "s_line", "s_track", "loc", "alt_loc", "suffix", "alt_match_name"]]

print("ARE THERE DUPLICATES in output? ")
print(output.duplicated().loc[lambda x: x==True].size, "duplicates")

The Terminal Output:

ARE THERE DUPLICATES in m3? 
5241    True
5242    True
5243    True
5355    True
5356    True
5357    True
dtype: bool
ARE THERE DUPLICATES in output? 
1838 duplicates

Of course, I could easily fix the problem by calling .drop_duplicates(keep="first"), but I'm more interesting in learning why .loc[] displays this behavior.

David
  • 606
  • 9
  • 19

1 Answers1

1

output filters for selected columns from m3. When you call duplicated on m3, all columns from the original dataframe are considered. When you call duplicated on output, only a subset of those columns is considered.

Therefore, you can have duplicates in output even when there are no duplicates in m3.

Here's a minimal and reproducible example of what you're seeing:

df = pd.DataFrame([[3, 8, 9], [4, 8, 9]])
print(df.duplicated().sum(), 'duplicates')
# 0 duplicates

df_filtered = df.loc[:, [1, 2]]
print(df_filtered.duplicated().sum(), 'duplicates')
# 1 duplicates
jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    Thanks @jpp! I was looking at this for a solid hour and now I'm having a real "duh" moment. Like why didn't I see it sooner! Anyhow, I upvoted your answer too, I suppose it'll show when I have more reputation. – David Nov 16 '18 at 23:23