2

I put some data together for the 2015 FIFA Women's World Cup:

import pandas as pd

df = pd.DataFrame({
    'team':['Germany','USA','France','Japan','Sweden','England','Brazil','Canada','Australia','Norway','Netherlands','Spain',
       'China','New Zealand','South Korea','Switzerland','Mexico','Colombia','Thailand','Nigeria','Ecuador','Ivory Coast','Cameroon','Costa Rica'],
    'group':['B','D','F','C','D','F','E','A','D','B','A','E','A','A','E','C','F','F','B','D','C','B','C','E'],
    'fifascore':[2168,2158,2103,2066,2008,2001,1984,1969,1968,1933,1919,1867,1847,1832,1830,1813,1748,1692,1651,1633,1485,1373,1455,1589],
    'ftescore':[95.6,95.4,92.4,92.7,91.6,89.6,92.2,90.1,88.7,88.7,86.2,84.7,85.2,82.5,84.3,83.7,81.1,78.0,68.0,85.7,63.3,75.6,79.3,72.8]
    })

df.groupby(['group', 'team']).mean()

output

Now I would like to generate a new dataframe that contains the 6 possible pairings or matches within each group from df, in a format like:

group    team1        team2
A        Canada       China
A        Canada       Netherlands
A        Canada       New Zealand
A        China        Netherlands
A        China        New Zealand
A        Netherlands  New Zealand
B        Germany      Ivory Coast
B        Germany      Norway
...     

What is a concise and clean way to do this? I can do a bunch of loops through each group and team, but I feel like there should be a cleaner vectorized way to do this with pandas and the split-apply-combine paradigm.

EDIT: I also welcome any R answers, think it'd be interesting to compare between the R and Pandas ways here. Added the r tag.

Here's the data in R form, as requested in Comments:

team <- c('Germany','USA','France','Japan','Sweden','England','Brazil','Canada','Australia','Norway','Netherlands','Spain',
      'China','New Zealand','South Korea','Switzerland','Mexico','Colombia','Thailand','Nigeria','Ecuador','Ivory Coast','Cameroon','Costa Rica')
group <- c('B','D','F','C','D','F','E','A','D','B','A','E','A','A','E','C','F','F','B','D','C','B','C','E')
fifascore <- c(2168,2158,2103,2066,2008,2001,1984,1969,1968,1933,1919,1867,1847,1832,1830,1813,1748,1692,1651,1633,1485,1373,1455,1589)
ftescore <- c(95.6,95.4,92.4,92.7,91.6,89.6,92.2,90.1,88.7,88.7,86.2,84.7,85.2,82.5,84.3,83.7,81.1,78.0,68.0,85.7,63.3,75.6,79.3,72.8)

df <- data.frame(team, group, fifascore, ftescore)
selwyth
  • 2,417
  • 16
  • 19

2 Answers2

3

Here's the two-line solution:

import itertools

for grpname,grpteams in df.groupby('group')['team']:
    # No need to use grpteams.tolist() to convert from pandas Series to Python list
    print list(itertools.combinations(grpteams, 2))

[('Canada', 'Netherlands'), ('Canada', 'China'), ('Canada', 'New Zealand'), ('Netherlands', 'China'), ('Netherlands', 'New Zealand'), ('China', 'New Zealand')]
[('Germany', 'Norway'), ('Germany', 'Thailand'), ('Germany', 'Ivory Coast'), ('Norway', 'Thailand'), ('Norway', 'Ivory Coast'), ('Thailand', 'Ivory Coast')]
[('Japan', 'Switzerland'), ('Japan', 'Ecuador'), ('Japan', 'Cameroon'), ('Switzerland', 'Ecuador'), ('Switzerland', 'Cameroon'), ('Ecuador', 'Cameroon')]
[('USA', 'Sweden'), ('USA', 'Australia'), ('USA', 'Nigeria'), ('Sweden', 'Australia'), ('Sweden', 'Nigeria'), ('Australia', 'Nigeria')]
[('Brazil', 'Spain'), ('Brazil', 'South Korea'), ('Brazil', 'Costa Rica'), ('Spain', 'South Korea'), ('Spain', 'Costa Rica'), ('South Korea', 'Costa Rica')]
[('France', 'England'), ('France', 'Mexico'), ('France', 'Colombia'), ('England', 'Mexico'), ('England', 'Colombia'), ('Mexico', 'Colombia')]

Explanation:

First we get the teamlist of teams within each group using df.groupby('group') , iterate through that and accessing its 'team' series, to get a list of the 4 teams within each group:

for grpname,grpteams in df.groupby('group')['team']:
    teamlist = grpteams.tolist()
... 
['Canada', 'Netherlands', 'China', 'New Zealand']
['Germany', 'Norway', 'Thailand', 'Ivory Coast']
['Japan', 'Switzerland', 'Ecuador', 'Cameroon']
['USA', 'Sweden', 'Australia', 'Nigeria']
['Brazil', 'Spain', 'South Korea', 'Costa Rica']
['France', 'England', 'Mexico', 'Colombia']

Then we generate the all-play-all list of tuples of teams. David Arenburg's post reminded me to use itertools.combinations(..., 2). But we could have used a generator or nested for-loops:

def all_play_all(teams):
  for team1 in teams:
    for team2 in teams:
      if team1 < team2: # [Note] We don't need to generate indices then index into teamlist, just use direct string comparison
        yield (team1,team2)

>>> [match for match in all_play_all(grpteams)]
[('France', 'Mexico'), ('England', 'France'), ('England', 'Mexico'), ('Colombia', 'France'), ('Colombia', 'England'), ('Colombia', 'Mexico')]

Note we're taking a shortcut to first generating all possible tuples of indices, then using those to index into teamlist:

>>> T = len(teamlist) + 1
>>> [(i,j) for i in range(T) for j in range(T) if i<j]
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

(Note: if we had used the approach of directly comparing team names, it would have the slight side-effect of resorting (alphabetically) the group names (they were originally sorted by seeding, not alphabetically), so e.g. 'China' < 'Netherlands', so their pairing will show up as ('Netherlands','China') not ('China',Netherlands'))

smci
  • 32,567
  • 20
  • 113
  • 146
  • You don't need to call tolist, `print(list(combinations(grp, 2)))` will do everything you need, the group is also the first item which would be needed to create the dataframe – Padraic Cunningham Jun 02 '15 at 22:10
  • 1
    @PadraicCunningham: Yeah thanks. Didn't know it ignored the pandas row-indices when applying a function to the Series. – smci Jun 02 '15 at 22:14
  • Thanks for introducing me to `itertools`! For those wanting to get to the finish line with a `DataFrame`, I constructed a `dict` to include the group (`_` in @smci's `for` loop, `grp` in @Padraic's approach in the question comments), used `DataFrame.from_dict` and `melt` to get it into the format I wanted, then `map` to pull teams out of the tuple. – selwyth Jun 03 '15 at 00:33
  • @selwyth: I called it `for _,grp in df.groupby...` to show we were throwing away the group-name. Now rewritten as `for grpname,grpteams in df.groupby...` – smci Jun 03 '15 at 22:46
3

Using R, here's a possible data.table solution using it's devel version on GitHub

#### To install development version
## library(devtools)
## install_github("Rdatatable/data.table", build_vignettes = FALSE)

library(data.table) ## v >= 1.9.5
setDT(df)[, transpose(combn(team, 2L, simplify = FALSE)), keyby = group]
#    group          V1          V2
# 1:     A      Canada Netherlands
# 2:     A      Canada       China
# 3:     A      Canada New Zealand
# 4:     A Netherlands       China
# 5:     A Netherlands New Zealand
# 6:     A       China New Zealand
# 7:     B     Germany      Norway
# 8:     B     Germany    Thailand
...
David Arenburg
  • 91,361
  • 17
  • 137
  • 196