-2

So I have this dataset which I have been cleaning for someone else but they want a specific column made into several columns by type of observation. For example this is a column of diagnoses and she wants this column to be expanded so it is one column for one diagnosis, and another for a different diagnosis. Thus I column with Depression, ADHD, Asthma, Cancer etc would be expanded to one column called depression, one called ADHD etc etc.

I'm pretty sure this violates the principles of tidy data, but the person I am doing this for is adamant this is the way they want it done. So I have tried looking at the tidyr and dplyr packages but so far I am having no luck and could use some advice.

Thanks for your help in advance

   Order Diagnosis

1   1   Synaesthesia
2   1   Synaesthesia
3   1   Synaesthesia
4   1   Synaesthesia
5   1   Synaesthesia
6   1   Synaesthesia
7   1   ADHD
8   1   ADHD
9   1   ADHD
10  1   ADHD
11  1   ADHD
12  1   ADHD
13  1   ADHD
14  1   ADHD
15  1   ADHD
16  1   ADHD
17  1   ADHD
18  1   ADHD
19  1   ADHD
20  1   ADHD
21  1   ADHD
22  1   ADHD
23  1   ADHD
24  1   ADHD
25  1   ADHD
26  1   ADHD
27  1   ADHD
28  1   ADHD
29  1   ADHD
30  1   ADHD
31  1   ADHD
32  1   ADHD
33  1   ADHD
34  1   ADHD
35  1   ADHD
36  1   ADHD
37  1   ADHD
googleplex101
  • 195
  • 2
  • 13
  • You may want to look at `reshape2` package and convert the data from long to wide form. – Metrics Feb 14 '15 at 14:28
  • Thanks for your comment. Can you expand a little on which specific functions in reshape2 I should use please? – googleplex101 Feb 14 '15 at 14:49
  • See here: http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/. Alternatively, you can use `tidyr` package: http://blog.rstudio.org/2014/07/22/introducing-tidyr/ – Metrics Feb 14 '15 at 14:57
  • Oh dear, I don't seem to be able to get the hang of this. Can you give me an example please? It just doesn't seem to be working for me sorry. – googleplex101 Feb 14 '15 at 15:21
  • so far what I have is this: data3<-dcast(data2, .~ diagnosis) which returns a data frame containing numeric values for how often each diagnosis is present in the data but what I want is a series of columns by diagnosis with the diagnosis strings as variables, located on the rows where the subject has that diagnosis. – googleplex101 Feb 14 '15 at 15:27
  • Please post a sample data. – Metrics Feb 14 '15 at 15:31
  • Sorry, put example data in original post. – googleplex101 Feb 14 '15 at 15:45
  • can you label the column name properly? Also, do you have only two columns? – Metrics Feb 14 '15 at 15:55
  • Named the columns correctly. No I have 63 columns, but I am worried about revealing any more as it is not my data. – googleplex101 Feb 14 '15 at 16:00

2 Answers2

1

It's not entirely clear what your expected results are, but one interpretation is that you are looking to recode your data, e.g. by using dummy coding.

A simple way to do this is to use model.matrix(). Try this:

model.matrix(~ Diagnosis - 1, dat)

   DiagnosisADHD DiagnosisSynaesthesia
1              0                     1
2              0                     1
3              0                     1
4              0                     1
5              0                     1
6              0                     1
7              1                     0
8              1                     0
9              1                     0
10             1                     0
...
Andrie
  • 176,377
  • 47
  • 447
  • 496
0

You could split your "vector" (or column in your case), pad it with NAs and cbind it into a fully pledged data.frame or matrix.

x <- sample(LETTERS[1:5], size = 100, replace = TRUE)
sx <- split(x, x)

ml <- max(unlist(lapply(sx, length)))

# pad the data with NAs
do.call("cbind", lapply(sx, FUN = function(m) c(m, rep(NA, ml - length(m)))))

      A   B   C   D   E  
 [1,] "A" "B" "C" "D" "E"
 [2,] "A" "B" "C" "D" "E"
 [3,] "A" "B" "C" "D" "E"
 [4,] "A" "B" "C" "D" "E"
 [5,] "A" "B" "C" "D" "E"
 [6,] "A" "B" "C" "D" "E"
 [7,] "A" "B" "C" "D" "E"
 [8,] "A" "B" "C" "D" "E"
 [9,] "A" "B" "C" "D" "E"
[10,] "A" "B" "C" "D" "E"
[11,] "A" "B" "C" "D" "E"
[12,] "A" "B" "C" "D" "E"
[13,] "A" "B" "C" "D" "E"
[14,] "A" "B" "C" "D" "E"
[15,] NA  "B" "C" "D" "E"
[16,] NA  "B" "C" "D" "E"
[17,] NA  "B" "C" "D" "E"
[18,] NA  "B" "C" "D" "E"
[19,] NA  "B" "C" "D" "E"
[20,] NA  "B" "C" "D" "E"
[21,] NA  "B" "C" "D" NA 
[22,] NA  NA  "C" "D" NA 
[23,] NA  NA  NA  "D" NA 
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197