Data manipulation, kind of downsampling

Question

I have a large csv file, example of the data below. I will use an example of eight teams to illustrate.

home_team    away_team      home_score       away_score         year
belgium      france         2                2                  1990
brazil       uruguay        3                1                  1990
italy        belgium        1                2                  1990
sweden       mexico         3                1                  1990

france       chile          3                1                  1991
brazil       england        2                1                  1991
italy        belgium        1                2                  1991
chile        switzerland    2                2                  1991

My data runs for many years. I would like to have total number of scores of each team every year, see example below,

team            total_scores          year
belgium         4                     1990
france          2                     1990
brazil          3                     1990
uruguay         1                     1990
italy           1                     1990
sweden          3                     1990
mexico          1                     1990

france          3                     1991
chile           5                     1991
brazil          2                     1991
england         1                     1991
italy           1                     1991
belgium         2                     1991
switzerland     2                     1991

Thoughts?

In addition to what you've tried, it would be helpful to know in what language—you've tagged 2 of them — camille, Feb 21 '20 at 15:47

score 2 · Answer 1 · answered Feb 21 '20 at 13:35

Here is a solution using the tidyverse (dplyr and tidyr), in particular the pivot functions from tidyr...

library(tidyverse)

df %>% pivot_longer(cols = -year,   #splits non-year columns into home/away and type columns
                    names_to = c("homeaway", "type"), 
                    names_sep = "_", 
                    values_to = "value", 
                    values_ptypes = list(value = character())) %>% 
  select(-homeaway) %>%             #remove home/away
  pivot_wider(names_from = "type",  #restore team and score columns (as list columns)
              values_from = "value") %>% 
  unnest(cols = c(team, score)) %>% #unnest the list columns to year, team, score
  group_by(year, team) %>% 
  summarise(total_goals = sum(as.numeric(score)))

# A tibble: 14 x 3
# Groups:   year [2]
    year team        total_goals
   <int> <chr>             <dbl>
 1  1990 belgium               4
 2  1990 brazil                3
 3  1990 france                2
 4  1990 italy                 1
 5  1990 mexico                1
 6  1990 sweden                3
 7  1990 uruguay               1
 8  1991 belgium               2
 9  1991 brazil                2
10  1991 chile                 3
11  1991 england               1
12  1991 france                3
13  1991 italy                 1
14  1991 switzerland           2

This looks overly complicated to me. Pivoting back and forth AND un-nesting. Basically everything that needs to be done after the `pivot_longer()` up until the `group_by()` is just necessary, because the first step was not an appropriate choice. The initial data isn't wide, as repeated values for the same case are not stored in separate columns, therefore pivoting is not necessary. — Till, Feb 21 '20 at 19:11

Allan Cameron · Answer 2 · 2020-02-21T13:32:25.270

You can try:

library(dplyr)

setNames(rbind(df[,c(1,3,5)], 
               setNames(df[,c(2,4,5)], names(df[,c(1,3,5)]))), 
         c("Country", "Goals", "Year")) %>%
  group_by(Year, Country) %>% 
  summarize(Total = sum(Goals))
#> # A tibble: 14 x 3
#> # Groups:   Year [2]
#>     Year Country     Total
#>    <int> <chr>       <int>
#>  1  1990 belgium         4
#>  2  1990 brazil          3
#>  3  1990 france          2
#>  4  1990 italy           1
#>  5  1990 mexico          1
#>  6  1990 sweden          3
#>  7  1990 uruguay         1
#>  8  1991 belgium         2
#>  9  1991 brazil          2
#> 10  1991 chile           3
#> 11  1991 england         1
#> 12  1991 france          3
#> 13  1991 italy           1
#> 14  1991 switzerland     2

^{Created on 2020-02-21 by the reprex package (v0.3.0)}

score 1 · Accepted Answer · answered Feb 21 '20 at 14:12

Here is yet another solution in R.

#Packages needed
library(dplyr)
library(magrittr)
library(tidyr)

#Your data
home_team <- c("belgium", "brazil", "italy", "sweden",
               "france", "brazil", "italy", "chile")
away_team <- c("france", "uruguay", "belgium", "mexico",
               "chile", "england", "belgium", "switzerland")
home_score <- c(2,3,1,3,
                3,2,1,2)
away_score <- c(2,1,2,1,
                1,1,2,2)
year <- c(1990, 1990, 1990, 1990,
          1991, 1991, 1991, 1991)

df <- data.frame(home_team, away_team, home_score, away_score, year, stringsAsFactors = FALSE)

df

#   home_team   away_team home_score away_score year
# 1   belgium      france          2          2 1990
# 2    brazil     uruguay          3          1 1990
# 3     italy     belgium          1          2 1990
# 4    sweden      mexico          3          1 1990
# 5    france       chile          3          1 1991
# 6    brazil     england          2          1 1991
# 7     italy     belgium          1          2 1991
# 8     chile switzerland          2          2 1991


#Column names for the new data.frames
my_colnames <- c("team", "score", "year")

#Using select() to create separate home and away datasets
df_home <- df %>% select(matches("home|year")) %>% setNames(my_colnames) %>% mutate(game_where = "home")
df_away <- df %>% select(matches("away|year")) %>% setNames(my_colnames) %>% mutate(game_where = "away")

#rbind()'ing both data.frames
#Grouping the rows together first by the team and then by the year
#Summing up the scores for the aforementioned groupings
#Sorting the newly produced data.frame by year
df_1 <- rbind(df_home, df_away) %>% group_by(team, year) %>% tally(score) %>% arrange(year)

df_1 

 #   team         year     n
 #   <chr>       <dbl> <dbl>
 # 1 belgium      1990     4
 # 2 brazil       1990     3
 # 3 france       1990     2
 # 4 italy        1990     1
 # 5 mexico       1990     1
 # 6 sweden       1990     3
 # 7 uruguay      1990     1
 # 8 belgium      1991     2
 # 9 brazil       1991     2
 #10 chile        1991     3
 #11 england      1991     1
 #12 france       1991     3
 #13 italy        1991     1
 #14 switzerland  1991     2

score 1 · Answer 4 · answered Feb 21 '20 at 19:02

Adding a solution that uses dplyr only.

 library(dplyr)

 bind_rows(
   select(df, team = home_team, score = home_score, year),
   select(df, team = away_team, score = away_score, year)
 ) %>% 
   group_by(team, year) %>% 
   summarise(total_scores = sum(score))

Data manipulation, kind of downsampling

4 Answers4

Linked