Efficiently Reshape Pandas Dataframe with Equivalent Variable

Question

I'm struggling with reshaping a pandas data frame. Basically, I'm experimenting with sports statistics and I'm trying to tidy up box scores. I have a data frame which has visitor team and home team listed, along with their scores, all in unique columns like this:

Before Table

df = pd.DataFrame(data={'GameID': [1, 2], 'Date': ['9/10/2020', '9/13/2020'], 'Visitor': ['Houston Texans', 'Seattle Seahawks'], 'Score_V': [20,38], 'Home':['Kansas City Chiefs', 'Atlanta Falcons'], 'Score_H':[34,25]})

I'm reshaping the data frame so that the teams and scores all end up in their own column, so that I end up with something like this:

After Table

df2 = pd.DataFrame(data={'GameID': [1, 1, 2, 2], 'Date': ['9/10/2020', '9/10/2020', '9/13/2020', '9/13/2020'], 'Team': ['Houston Texans', 'Kansas City Chiefs', 'Seattle Seahawks', 'Atlanta Falcons'], 'Location':['Away', 'Home', 'Away', 'Home'], 'Score': [20,34,38,25]})

I came up with the following solution, where I melted the data frame and used .loc() logic to find and replace values in the various columns. But I feel like this solution is not very elegant and that I missed some obvious capability of pandas.melt().

df = df.melt(id_vars=['GameID','Date','Home','Visitor'])
df = df.rename(columns={"variable": "Location", "value": "Score"})
df.loc[df['Location'] == 'Score_V', 'Team'] = df['Visitor']
df.loc[df['Location'] == 'Score_V', 'Location'] = 'Away'
df.loc[df['Location'] == 'Score_H', 'Team'] = df['Home']
df.loc[df['Location'] == 'Score_H', 'Location'] = 'Home'
df = df.drop(columns=['Home', 'Visitor'])

Is there a simpler solution than this?

user6386471 · Answer 1 · 2020-11-22T20:07:46.447

It's not immediately clear how to use a direct melt approach that's simpler than your solution, however, you could have a look at reshaping using pandas lreshape:

df['location_V'] = 'Away'
df['location_H'] = 'Home'

cols = ['GameID','Date','Team','Location','Score']

pd.lreshape(df, {'Team':['Visitor','Home'],
                 'Score':['Score_V','Score_H'],
                 'Location':['location_V','location_H']})\
  .sort_values(['GameID','Location']).reset_index(drop=True)[cols]

>>> 

#   GameID   Date              Team         Location    Score
# 0   1    9/10/2020    Houston Texans        Away       20
# 1   1    9/10/2020    Kansas City Chiefs    Home       34
# 2   2    9/13/2020    Seattle Seahawks      Away       38
# 3   2    9/13/2020    Atlanta Falcons       Home       25

Here's another way to approach this which involves extracting row values using lambda functions, and then creating a new dataframe with these row values:

home_score = lambda x: (x['GameID'],x['Date'],x['Home'],'Home',x['Score_H'])
away_score = lambda x: (x['GameID'],x['Date'],x['Visitor'],'Away',x['Score_V'])

data = pd.concat([df.apply(home_score,axis=1), df.apply(away_score,axis=1)],axis=0).to_list()

pd.DataFrame(data, columns =['GameID','Date','Team','Location','Score']).sort_values(['GameID','Location']).reset_index(drop=True)

>>> 

#   GameID   Date              Team         Location    Score
# 0   1    9/10/2020    Houston Texans        Away       20
# 1   1    9/10/2020    Kansas City Chiefs    Home       34
# 2   2    9/13/2020    Seattle Seahawks      Away       38
# 3   2    9/13/2020    Atlanta Falcons       Home       25

sammywemmy · Answer 2 · 2021-01-08T21:02:13.123

To solve this, we need to find a way to rename the columns, so that we can pair home teams with 'away' and away teams with away:

    renamed = df.rename(
    columns=lambda column: "Team_Away"
    if column == "Visitor"
    else "Team_Home"
    if column == "Home"
    else f"{column[:-2]}_Away"
    if column.endswith("V")
    else f"{column[:-2]}_Home"
    if column.endswith("H")
    else column
)

    GameID  Date    Team_Away   Score_Away  Team_Home   Score_Home
0   1   9/10/2020   Houston Texans  20  Kansas City Chiefs  34
1   2   9/13/2020   Seattle Seahawks    38  Atlanta Falcons 25

You can then reshape using wide_to_long:

    pd.wide_to_long(
    renamed,
    stubnames=["Team", "Score"],
    i=["GameID", "Date"],
    j="Location",
    sep="_",
    suffix=".+",
)

                                Team          Score
GameID  Date    Location        
   1    9/10/2020   Away    Houston Texans      20
                    Home    Kansas City Chiefs  34
   2    9/13/2020   Away    Seattle Seahawks    38
                    Home    Atlanta Falcons     25

You can also use the pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github:

 # install latest dev version
# pip install git+https://github.com/ericmjl/pyjanitor.git
 import janitor
renamed.pivot_longer(index=["GameID", "Date"], 
                     names_to=(".value", "Location"), 
                     names_sep="_",
                     sort_by_appearance=True)   

  GameID    Date      Location      Team          Score
0   1      9/10/2020    Away    Houston Texans      20
1   1      9/10/2020    Home    Kansas City Chiefs  34
2   2      9/13/2020    Away    Seattle Seahawks    38
3   2      9/13/2020    Home    Atlanta Falcons     25

Efficiently Reshape Pandas Dataframe with Equivalent Variable

2 Answers2