1

I have a DataFrame imported from a csv file called test_data that's structured like this:

School Name Test Score Students in Grade 11 Students in Grade 12
PS 1 Advanced 170 106
PS 1 Basic 87 53
PS 2 Advanced 141 120
PS 2 Basic 168 167

I wanted the data for each school to be in one row only, so I used .pivot_table() to do this:

test_data_by_grade_11 = test_data.pivot_table("Students in Grade 11", ["School Name"], "Test Score")


test_data_by_grade_12 = test_data.pivot_table(("Students in Grade 12", ["School Name"], "Test Score")

test_data_by_grade_11.rename(columns = {"Advanced":"Grade 11 Advanced", "Basic":"Grade 11 Basic"}, inplace = True)

test_data_by_grade_12.rename(columns = {"Advanced":"Grade 12 Advanced", "Basic":"Grade 12 Basic"}, inplace = True)

And I successfully got two tables like this with the word "Test Score" above the headers of the table:

School Name Grade 11 Advanced Grade 11 Basic
PS 1 170 87
PS 2 141 168
School Name Grade 12 Advanced Grade 12 Basic
PS 1 106 53
PS 2 120 167

I wanted to merge these pivot tables back together to create a table like this:

School Name Grade 11 Advanced Grade 11 Basic Grade 12 Advanced Grade 12 Basic
PS 1 170 87 106 53
PS 2 141 168 120 167

However, when I try to merge them, I get really skewed data and duplicates that don't match the .csv file, like this:

School Name Grade 11 Advanced Grade 11 Basic Grade 12 Advanced Grade 12 Basic
PS 1 170 87 87 105
PS 2 170 87 87 136
PS 3 170 87 87 85
PS 4 170 87 87 96

(The Grade 12 Basic values DO exist in the file, but definitely don't match up with the correct rows).

As a result of these duplicates, the output returns almost 5 million rows, whereas the original .csv file has around 6500. Not sure what's going on here or what I'm missing - I suspect it's something to do with the left_on or right_on properties but I've played around with those and haven't gotten the desired table yet.

FJJ
  • 37
  • 4

2 Answers2

2

You can obtain the final dataframe with just one .pivot:

df = df.rename(columns={'Students in Grade 11': 'Grade 11', 'Students in Grade 12': 'Grade 12'})

out = df.pivot(index='School Name', columns='Test Score')
out.columns = [f'{a} {b}' for a, b in out.columns]
print(out.reset_index())

Prints:

  School Name  Grade 11 Advanced  Grade 11 Basic  Grade 12 Advanced  Grade 12 Basic
0        PS 1                170              87                106              53
1        PS 2                141             168                120             167
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Thanks for the input. I've tried this but am getting the error "Index contains duplicate entries, cannot reshape." I've heard aggfunc can be used to solve this, but I don't think I need to take the sum or mean of any of the data so I'm not sure what to do here. – FJJ Mar 25 '23 at 19:38
  • 1
    @FJJ You can drop the duplicates of use `df.groupby()` and do sum/mean/... Just make sure in column `School Name` you have unique values. Other possibility is to use `.pivot_table` instead `.pivot` with correct `aggfunc=` – Andrej Kesely Mar 25 '23 at 19:51
  • 1
    `.pivot_table` worked. That's the only change I made - I didn't need to add an `aggfunc=`. Thanks again. – FJJ Mar 25 '23 at 20:12
0

A slightly similar result:

df=df.pivot_table(index=['School Name'],columns='Test Score',\
          values=['Students in Grade 11','Students in Grade 12'])

#concat column level index name with level 0 index name using list comprehension:

df.columns=[(col[0] + '-' + col[1]) for col in df.columns]

Finally reset row level index:

df.reset_index(inplace=True)