1

I have a panel data set with a large number of time periods and units, and also a large number of variables which I observe for each period and unit.

Since I want to apply a univariate time series operation for each unit and variable, I have to bring the panel data to wide format (using data.table::dcast) so that each column shows now a variable for a given unit across time.

After applying my time series observation I want to go back to the "long" panel format (using data.table::melt), however, here, I lose information on the unit names and variables. Since the data.table is quite large, I am afraid of mixing up data here which is why I would like to find a melt operation that preserves the variables and value names.

Consider the following example panel data set:

require(data.table)

dates <- seq(from = as.Date("2007-02-01"), to = as.Date("2012-01-01"), by = "month")
id <- paste0(c("A", "B", "C"), 1:10)


DT <- data.table(
  time = rep(dates, 10),
  idx = rep(id, each = 60),
  String1 = runif(600),
  String2 = runif(600),
  String3 = runif(600)
)       


           time idx   String1    String2   String3
  1: 2007-02-01  A1 0.5412122 0.23502234 0.3858354
  2: 2007-03-01  A1 0.3248168 0.32884580 0.7183147
  3: 2007-04-01  A1 0.4183034 0.40781723 0.7438458
  4: 2007-05-01  A1 0.3597997 0.51745402 0.1660566
  5: 2007-06-01  A1 0.6405351 0.96121729 0.7786483
 ---                                              
596: 2011-09-01 A10 0.7896711 0.64740298 0.8285408
597: 2011-10-01 A10 0.6582652 0.83986453 0.1292342
598: 2011-11-01 A10 0.1110465 0.41741672 0.7076345
599: 2011-12-01 A10 0.5108850 0.02940229 0.9038370
600: 2012-01-01 A10 0.2605052 0.10136480 0.3881788

I am bringing this panel data set to wide format. After applying some time series operations with it (not shown here), I will have to delete some columns if there's not enough data for them. Then, I bring the data back to long format

variable_names <- names(DT[,-c("time", "idx")])

DT_long <- dcast(DT, time ~ idx, value.var = variable_names)
DT_long <- DT_long[,-(5:10)]


DT_wide <- melt(DT_long, measure = patterns("^String1", "^String2", "^String3"), value.name = variable_names, variable.name = "idx)

           time idx   String1   String2    String3
  1: 2007-02-01   1 0.9794707 0.5290352 0.68009050
  2: 2007-03-01   1 0.4016173 0.9229200 0.38652407
  3: 2007-04-01   1 0.9475505 0.5956701 0.24686007
  4: 2007-05-01   1 0.6465847 0.8233340 0.08008369
  5: 2007-06-01   1 0.5704834 0.8232598 0.85790038
 ---                                              
596: 2011-09-01  10        NA 0.5525413 0.79994190
597: 2011-10-01  10        NA 0.3895864 0.41347910
598: 2011-11-01  10        NA 0.3123646 0.44461146
599: 2011-12-01  10        NA 0.2148686 0.37609448
600: 2012-01-01  10        NA 0.7314114 0.47138012

DT_wide now looks like this which means I have lost the information on variable (here: idx) names. One fix I imagine is numbering all idx through with numbers and then doing this operation. However, if possible I would like to preserve string names since they help me a lot to distinguish and understand the values. Can someone help me how I can rewrite melt to include this information?

Florestan
  • 127
  • 1
  • 15
  • 1
    Could you edit your post to include *minimal* sample data and matching expected output (think if it's really necessary to have a 600 row `data.table`). Also `DT` looks like it's in wide format already (a long table would have values from the `String1`, `String2`, `String3` columns in a `key` & `value` column). So instead of *"I am bringing this panel data set to wide format"* did you mean to say "bringing this panel data set to *long* format"? – Maurits Evers Sep 09 '19 at 10:13
  • 1
    You are aware of the argument `value.name` in `melt()`? – s_baldur Sep 09 '19 at 10:14
  • Yes, value.name solves the "value" problem but not the idx problem. I am adjusting my question, thanks @sindri_baldur! – Florestan Sep 09 '19 at 10:22
  • @MauritsEvers For me the time - unit panel format is the right way to work with this data and I need to transform it as in my example to wide format. I see that it might not be the definition of long format most people use so I am replacing it by panel format. I do not think however that this really changes the way I framed my question. – Florestan Sep 09 '19 at 10:28
  • 1
    @Florestan The definition of long and wide data is pretty standardised now (but even [Hadley refers to the distinction as being imprecise](https://vita.had.co.nz/papers/tidy-data.pdf)). The main issue is with your sample data being quite unwieldy. I don't think it's necessary to have a 600 row `data.table` to illustrate your point. You're more likely to get a quick & good response if you make it easy for people to help you. After all, *you* are asking others to help you with *your* problem. – Maurits Evers Sep 09 '19 at 10:40

1 Answers1

1

Reading through ?melt and the Efficient reshaping vignette, I cant see how to do this directly with melt.data.table. However, you could use pivot_longer() from the development version of tidyr:

library(data.table)

dates <- seq(from = as.Date("2007-02-01"), to = as.Date("2007-04-01"), by = "month")
id <- c("A1", "B2")

DT <- data.table(
  time = rep(dates, 2),
  idx = rep(id, each = 3),
  String1 = runif(6),
  String2 = runif(6)
)

DT
#>          time idx   String1   String2
#> 1: 2007-02-01  A1 0.6453802 0.4641508
#> 2: 2007-03-01  A1 0.1106000 0.3750282
#> 3: 2007-04-01  A1 0.6356700 0.9601759
#> 4: 2007-02-01  B2 0.9821609 0.1782534
#> 5: 2007-03-01  B2 0.4786173 0.1557481
#> 6: 2007-04-01  B2 0.7720111 0.7982246

variable_names <- names(DT[, -c("time", "idx")])
DT_long <- dcast(DT, time ~ idx, value.var = variable_names)

DT_long
#>          time String1_A1 String1_B2 String2_A1 String2_B2
#> 1: 2007-02-01  0.6453802  0.9821609  0.4641508  0.1782534
#> 2: 2007-03-01  0.1106000  0.4786173  0.3750282  0.1557481
#> 3: 2007-04-01  0.6356700  0.7720111  0.9601759  0.7982246
library(tidyr) # devtools::install_github("tidyverse/tidyr")

pivot_longer(
  data = DT_long,
  cols = starts_with("String"),
  names_sep = "_",
  names_to = c(".value", "idx")
)
#> # A tibble: 6 x 4
#>   time       idx   String1 String2
#>   <date>     <chr>   <dbl>   <dbl>
#> 1 2007-02-01 A1      0.645   0.464
#> 2 2007-02-01 B2      0.982   0.178
#> 3 2007-03-01 A1      0.111   0.375
#> 4 2007-03-01 B2      0.479   0.156
#> 5 2007-04-01 A1      0.636   0.960
#> 6 2007-04-01 B2      0.772   0.798

Created on 2019-09-09 by the reprex package (v0.3.0)

Mikko Marttila
  • 10,972
  • 18
  • 31
  • Thanks, very helpful comment and solves the problem! (If someone has other ideas, I would still be interested in a non-development version since at my workplace, usage of development versions is discouraged) – Florestan Sep 09 '19 at 12:46