Using dcast to cast a wide format when the variables per time period are not uniquely specified

Question

I would like to convert this csv file into a long format. It currently look as follows:

Every ID is listed 1 time for each years, so a total of 7 times.

What I would like to do is to for each ID have one row, where the variables are listed as logwks1 + logwks2 + .. + logwks6 + logwks7.

I have started by simply melting with wagem <- melt(wage, id = "ID")

I however don't understand how to cast it in order to have the desired format.

I tried to do wagec <- dcast(wagem, ID ~ variable), but then it counts the observations as the default (I think because it has no way to know how to cast them otherwise).

How do I fix this?

what should `logwks1` contain? the logwks of year 1? – Benjamin Schwetz Aug 09 '19 at 07:47 — Benjamin Schwetz, Aug 09 '19 at 07:47
Yes, exactly. My apologies for not making that clear. – Tom Aug 09 '19 at 07:48 — Tom, Aug 09 '19 at 07:48

Benjamin Schwetz · Accepted Answer · 2019-08-09T09:45:38.937

spread from tidyr can do this for you. Just change the values of the year column first to match the column names you want later.

library(tidyverse)
data <- tibble::tribble(
          ~Year,       ~LOGWKS, ~ID,
              1, "0,862124465",   1,
              2, "0,433704181",   1,
              3, "0,409959143",   1,
              4, "0,763847693",   1,
              5, "0,847479032",   1,
              6, "0,855926486",   1,
              7, "0,809774126",   1
          )
data %>% 
  mutate(
    Year = paste0("LOGWKS", Year)
  ) %>% 
  spread(
    Year, LOGWKS
  )
#> # A tibble: 1 x 8
#>      ID LOGWKS1   LOGWKS2   LOGWKS3   LOGWKS4   LOGWKS5   LOGWKS6  LOGWKS7 
#>   <dbl> <chr>     <chr>     <chr>     <chr>     <chr>     <chr>    <chr>   
#> 1     1 0,862124~ 0,433704~ 0,409959~ 0,763847~ 0,847479~ 0,85592~ 0,80977~

^{Created on 2019-08-09 by the reprex package (v0.3.0)}

edit: if you have multiple variables to spread you can use gather first and then cast it:

library(tidyverse)
data_semi_long <- tibble::tribble(
  ~Year,       ~LOGWKS,     ~whatever, ~ID,
      1, "0,402711636", "0,182708713",   1,
      2, "0,094020099", "0,776126975",   1,
      3, "0,948184845", "0,083343821",   1,
      4, "0,529592883", "0,462755147",   1,
      5, "0,612587798", "0,613195331",   1,
      6, "0,108845887", "0,032397081",   1,
      7, "0,585433903", "0,788149493",   1
  )
data_semi_long %>% 
  gather(key, value, -ID, - Year) %>% 
  mutate(
    Year = paste0(key, Year)
  ) %>% 
  reshape2::dcast(
    ID ~Year
  )
#>   ID     LOGWKS1     LOGWKS2     LOGWKS3     LOGWKS4     LOGWKS5
#> 1  1 0,402711636 0,094020099 0,948184845 0,529592883 0,612587798
#>       LOGWKS6     LOGWKS7   whatever1   whatever2   whatever3   whatever4
#> 1 0,108845887 0,585433903 0,182708713 0,776126975 0,083343821 0,462755147
#>     whatever5   whatever6   whatever7
#> 1 0,613195331 0,032397081 0,788149493

^{Created on 2019-08-09 by the reprex package (v0.3.0)}

Thank you for your answer Bernd! But if I understand your solution correctly, this only works for one variable, not for the dataset, right? logwks was just an example. I need the whole dataset converted. — Tom, Aug 09 '19 at 08:24
in that case, you should first get your data into proper long format, which is 1 observation per row. — Benjamin Schwetz, Aug 09 '19 at 08:27
Thanks! Is there any way to get these values in one row for each ID? — Tom, Aug 09 '19 at 09:36
my bad. I keep forgetting that `spread` doesn't really work the way **I** think it should :) You should use `reshape2::dcast()` there. — Benjamin Schwetz, Aug 09 '19 at 09:45

Using dcast to cast a wide format when the variables per time period are not uniquely specified

1 Answers1