Converting counts of levels to separate variables, including 0s for missing counts

Question

I have a dataframe (table), that includes frequency counts (Freq) of 2 levels (F, I) of a categorical variable (Fert).

table[1:10]

    FemID Sperm  Week Fert Freq
1:   269  High    1    F    4
2:   269  High    1    I    5
3:   273  High    1    F    6
4:   274  High    1    I    1
5:   275  High    1    I    1
6:   276  High    1    I    1
7:   278   Low    1    I    1
8:   280   Low    1    I    1
9:   281   Low    1    I    1
10:   282   Low    1    I    5

I would like to convert this to a dataframe in which the two levels of Fert (I and F) are separate variables for each value of FemID, with 0 for missing counts of a level, like so:

    FemID Sperm  Week Fert Infert
1:   269  High    1    4    5
2:   273  High    1    6    0
3:   274  High    1    1    0
4:   275  High    1    1    0
5:   276  High    1    1    0

Thoughts or suggestions? I feel like a loop is required, but I'm not sure how to go about setting it up for this. Perhaps there are two parts, one that creates the two new variables and one that fills in the 0's?

using `library(data.table);dcast(dat, FemID+Sperm+Week~Fert,fill = 0,value.var = "Freq")` — Onyambu, Mar 21 '18 at 19:03
@Onyambu Sorry mate. I had already added answer before I realized your comments. — MKR, Mar 21 '18 at 19:10

mysteRious · Accepted Answer · 2018-03-21T19:24:23.980

You can use spread in tidyr for this:

> library(tidyr)
> df %>% spread(Fert,Freq)
  FemID Sperm Week  F  I
1   269  High    1  4  5
2   273  High    1  6 NA
3   274  High    1 NA  1
4   275  High    1 NA  1
5   276  High    1 NA  1
6   278   Low    1 NA  1
7   280   Low    1 NA  1
8   281   Low    1 NA  1
9   282   Low    1 NA  5

You can also adjust the variable names:

> df %>% spread(Fert,Freq) %>% 
      setNames(c("FemID","Sperm","Week","Fert","Infert"))
  FemID Sperm Week Fert Infert
1   269  High    1    4      5
2   273  High    1    6     NA
3   274  High    1   NA      1
4   275  High    1   NA      1
.... the rest is truncated

And can filter by NAs:

> df %>% spread(Fert,Freq) %>% 
    setNames(c("FemID","Sperm","Week","Fert","Infert")) %>% 
    filter(!is.na(Fert))
  FemID Sperm Week Fert Infert
1   269  High    1    4      5
2   273  High    1    6     NA

On my PC `spread` comes with `tidyr`. May be you have got a special version of `dplyr`:-) — MKR, Mar 21 '18 at 19:21

MKR · Answer 2 · 2018-03-21T19:18:31.900

Since your data is in data.table hence dcast is an good option:

  library(data.table)
  setDT(df)
  dcast(df, FemID+Sperm+Week~Fert, value.var = "Freq") 

  #OR A shorter way could be as

  dcast(df, ...~Fert, value.var = "Freq")


  #    FemID  Sperm  Week  F  I
  # 1:   269  High    1  4  5
  # 2:   273  High    1  6 NA
  # 3:   274  High    1 NA  1
  # 4:   275  High    1 NA  1
  # 5:   276  High    1 NA  1
  # 6:   278   Low    1 NA  1
  # 7:   280   Low    1 NA  1
  # 8:   281   Low    1 NA  1
  # 9:   282   Low    1 NA  5

Data

  df <- read.table(text = "FemID Sperm  Week Fert Freq
  1:   269  High    1    F    4
  2:   269  High    1    I    5
  3:   273  High    1    F    6
  4:   274  High    1    I    1
  5:   275  High    1    I    1
  6:   276  High    1    I    1
  7:   278   Low    1    I    1
  8:   280   Low    1    I    1
  9:   281   Low    1    I    1
  10:   282   Low    1    I    5", header = TRUE, stringsAsFactors = FALSE)

Converting counts of levels to separate variables, including 0s for missing counts

2 Answers2