14

I'm in the process of compiling data from different data sets into one data set for analysis. I'll be doing data exploration, trying different things to find out what regularities may be hidden in the data, so I don't currently have a specific method in mind. Now I'm wondering if I should compile my data into long or wide format.

Which format should I use, and why?

I understand that data can be reshaped from long to wide or vice versa, but the mere existence of this functionality implies that the need to reshape sometimes arises and this need in turn implies that a specific format might be better suited for a certain task. So when do I need which format, and why?

I'm not asking about performance. That has been covered in other questions.

  • 3
    The overwhelming majority of R functions are designed for long format data.frames input. There are some special domains where wide format (typically matrices or arrays) is more common. – Roland Jan 04 '16 at 11:55
  • 7
    I would suggest reading the tidyr "Tidy Data" vignette (https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). Aside from describing the usage of tidyr it provides some insight into why data is structured in particular ways. – NGaffney Jan 04 '16 at 12:30
  • I find that in almost all of my work imported data is `melt` to long form and only ever `cast` to wide when the data is exported. – Joe Jan 04 '16 at 13:12
  • Thank you for the comments, especially for the link to the document on tidy data. @Roland Could you tell me in which special domains the wide format is more common? –  Jan 04 '16 at 15:59

3 Answers3

9

Hadley Wickham's Tidy Data paper, and the tidyr package that is his (latest) implementation of its principles, is a great place to start.

The rough answer to the question is that data, during processing, should always be long, and should only be widened for display purposes. Be cautious with this, though, as here "long" refers more to "tidy", rather than the pure long form.

Examples

Take, for example, the mtcars dataset. This is already in tidy form, in that each row represents a single observation. So "lengthening" it, to get something like this

        model type   value
1 AMC Javelin  mpg  15.200
2 AMC Javelin  cyl   8.000
3 AMC Javelin disp 304.000
4 AMC Javelin   hp 150.000
5 AMC Javelin drat   3.150
6 AMC Javelin   wt   3.435

is counterproductive; mpg and cyl are not comparable in any meaningful way.

Taking the ChickWeight dataset (which is in long form) and transforming it to wide by time

require(tidyr)
ChickWeight %>% spread(Time, weight)
   Chick Diet  0  2  4  6   8  10  12  14  16  18  20  21
1     18    1 39 35 NA NA  NA  NA  NA  NA  NA  NA  NA  NA
2     16    1 41 45 49 51  57  51  54  NA  NA  NA  NA  NA
3     15    1 41 49 56 64  68  68  67  68  NA  NA  NA  NA
4     13    1 41 48 53 60  65  67  71  70  71  81  91  96
5      9    1 42 51 59 68  85  96  90  92  93 100 100  98
6     20    1 41 47 54 58  65  73  77  89  98 107 115 117
7     10    1 41 44 52 63  74  81  89  96 101 112 120 124
8      8    1 42 50 61 71  84  93 110 116 126 134 125  NA
9     17    1 42 51 61 72  83  89  98 103 113 123 133 142
10    19    1 43 48 55 62  65  71  82  88 106 120 144 157
11     4    1 42 49 56 67  74  87 102 108 136 154 160 157
12     6    1 41 49 59 74  97 124 141 148 155 160 160 157
13    11    1 43 51 63 84 112 139 168 177 182 184 181 175
...

gives a visualization that may be useful, but for data analysis purposes, is very inconvenient, as computing things like growth rate become cumbersome.

user295691
  • 7,108
  • 1
  • 26
  • 35
  • Great answer. Thank you, @user295691. The examples are really helpful. But here: "So widening it, to get something like this", didn't you mean that you transformed the originally wide data to long? –  Jan 06 '16 at 14:02
  • Whoops! Fixed; changed "widening" to "lengthening" – user295691 Jan 06 '16 at 14:18
6

As Roland mentioned, most R functions need it in long format, and it is often easier to process data that way.

But on the other hand, it is easier for people to view and comprehend wide format, especially when it is being input and validated, where human comprehension is important for ensuring quality and accuracy.

So I see that data tends to start out life in wide format, and then become long as it becomes used more for processing. Fortunately converting back and forth is pretty easy nowadays, especially with the tidyr package.

Holger Brandl
  • 10,634
  • 3
  • 64
  • 63
Mike Wise
  • 22,131
  • 8
  • 81
  • 104
  • Thank you, great explanation. Together with the answer by user295691 this gave me the understanding I was seeking. Wish I could accept both answers ... I proposed an edit, because as a non-native speaker I was unfamiliar with the abbreviation "otoh". –  Jan 06 '16 at 14:04
  • 1
    Well, his answer was more comprehensive, so that is fine with me. I upvoted him too. – Mike Wise Jan 06 '16 at 14:06
3

The answer is imho quite straight forward. By default the long format takes up significantly more space as the new "variable" column needs to be represented as well. However, long data format can compress your data significantly. If you have a very sparse matrix - this is if many columns are NA - you can specify na.rm=true.

Furthermore, it allows more efficient calculations in many cases. But that one you defined out of scope.

CAFEBABE
  • 3,983
  • 1
  • 19
  • 38
  • Thank you, @CAFEBABE. I do not understand this sentence, however: "Or if many columns have a specific value you can define the drop argument." Could you elaborate? –  Jan 04 '16 at 15:55
  • 1
    I confused something. Sorry. I updated my answer accordingly. I'm actually puzzled that removal of elements is only for NA values and not for zeros. – CAFEBABE Jan 04 '16 at 16:32