How to merge pivot tables in Pandas without duplicates in columns

Question

I have a DataFrame imported from a csv file called test_data that's structured like this:

School Name	Test Score	Students in Grade 11	Students in Grade 12
PS 1	Advanced	170	106
PS 1	Basic	87	53
PS 2	Advanced	141	120
PS 2	Basic	168	167

I wanted the data for each school to be in one row only, so I used .pivot_table() to do this:

test_data_by_grade_11 = test_data.pivot_table("Students in Grade 11", ["School Name"], "Test Score")


test_data_by_grade_12 = test_data.pivot_table(("Students in Grade 12", ["School Name"], "Test Score")

test_data_by_grade_11.rename(columns = {"Advanced":"Grade 11 Advanced", "Basic":"Grade 11 Basic"}, inplace = True)

test_data_by_grade_12.rename(columns = {"Advanced":"Grade 12 Advanced", "Basic":"Grade 12 Basic"}, inplace = True)

And I successfully got two tables like this with the word "Test Score" above the headers of the table:

School Name	Grade 11 Advanced	Grade 11 Basic
PS 1	170	87
PS 2	141	168

School Name	Grade 12 Advanced	Grade 12 Basic
PS 1	106	53
PS 2	120	167

I wanted to merge these pivot tables back together to create a table like this:

School Name	Grade 11 Advanced	Grade 11 Basic	Grade 12 Advanced	Grade 12 Basic
PS 1	170	87	106	53
PS 2	141	168	120	167

However, when I try to merge them, I get really skewed data and duplicates that don't match the .csv file, like this:

School Name	Grade 11 Advanced	Grade 11 Basic	Grade 12 Advanced	Grade 12 Basic
PS 1	170	87	87	105
PS 2	170	87	87	136
PS 3	170	87	87	85
PS 4	170	87	87	96

(The Grade 12 Basic values DO exist in the file, but definitely don't match up with the correct rows).

As a result of these duplicates, the output returns almost 5 million rows, whereas the original .csv file has around 6500. Not sure what's going on here or what I'm missing - I suspect it's something to do with the left_on or right_on properties but I've played around with those and haven't gotten the desired table yet.

There is a typo in your code. The last two line produce syntax errors. — Marcelo Paco, Mar 25 '23 at 17:30

score 2 · Answer 1 · answered Mar 25 '23 at 17:37

2

You can obtain the final dataframe with just one .pivot:

df = df.rename(columns={'Students in Grade 11': 'Grade 11', 'Students in Grade 12': 'Grade 12'})

out = df.pivot(index='School Name', columns='Test Score')
out.columns = [f'{a} {b}' for a, b in out.columns]
print(out.reset_index())

Prints:

  School Name  Grade 11 Advanced  Grade 11 Basic  Grade 12 Advanced  Grade 12 Basic
0        PS 1                170              87                106              53
1        PS 2                141             168                120             167

answered Mar 25 '23 at 17:37

Andrej Kesely

168,389
15
48
91

Thanks for the input. I've tried this but am getting the error "Index contains duplicate entries, cannot reshape." I've heard aggfunc can be used to solve this, but I don't think I need to take the sum or mean of any of the data so I'm not sure what to do here. – FJJ Mar 25 '23 at 19:38
1

@FJJ You can drop the duplicates of use `df.groupby()` and do sum/mean/... Just make sure in column `School Name` you have unique values. Other possibility is to use `.pivot_table` instead `.pivot` with correct `aggfunc=` – Andrej Kesely Mar 25 '23 at 19:51
1

`.pivot_table` worked. That's the only change I made - I didn't need to add an `aggfunc=`. Thanks again. – FJJ Mar 25 '23 at 20:12

Sajil Alakkalakath · Answer 2 · 2023-04-06T17:36:16.293

0

A slightly similar result:

df=df.pivot_table(index=['School Name'],columns='Test Score',\
          values=['Students in Grade 11','Students in Grade 12'])

#concat column level index name with level 0 index name using list comprehension:

df.columns=[(col[0] + '-' + col[1]) for col in df.columns]

Finally reset row level index:

df.reset_index(inplace=True)

edited Apr 06 '23 at 17:36

answered Mar 26 '23 at 09:38

Sajil Alakkalakath

171
1
6

How to merge pivot tables in Pandas without duplicates in columns

2 Answers2