1

I'm trying to reshape my data using dcast. I'm working with samples where each sample has 10-30 sample units. I can't have my data aggregate.

My data is in this format:

ID  total
sample_1    1
sample_1    0
sample_1    2
sample_1    1
sample_1    0
sample_1    0
sample_1    2
sample_1    1
sample_1    0
sample_1    2
sample_1    1
sample_1    4
sample_2    2
sample_2    1
sample_2    2
sample_2    0
sample_2    0
sample_2    0
sample_2    1
sample_2    2
sample_2    1
sample_2    4
sample_2    5
sample_2    2
sample_2    1
sample_3    0
sample_3    0
sample_3    1
sample_3    2
sample_3    1
sample_3    0
sample_3    2
sample_3    1
sample_3    4
sample_3    5
sample_3    1
sample_3    1
sample_3    0
sample_3    0
sample_3    1

And I want it to looks like it:

sample_1    sample_2    sample_3
1           2           0
0           1           0
2           2           1
1           0           2
0           0           1
0           0           0
2           1           2
1           2           1
0           1           4
2           4           5
1           5           1
4           2           1
            1           0
                        0
                        1

Where my sample ID's turn into different columns.

I tried in several ways but R keep aggregating it.

Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Try `chunked <- sapply(split(dff, dff$ID), function(x) x[,2]); sapply(chunked, function(x) { length(x) = max(sapply(chunked, length)) x })`. I am assuming here that your dataset is called `dff`. – Abdou Mar 21 '17 at 02:18

1 Answers1

3

You can do this with dcast() but you have to add row numbers for each ID.

The data.table package is another package besides reshape2 which implements dcast(). data.table has a handy rowid() function to generate unique row ids within each group. WIth that, we get:

library(data.table)
dcast(setDT(DF), rowid(ID) ~ ID, value.var = "total")
#    ID sample_1 sample_2 sample_3
# 1:  1        1        2        0
# 2:  2        0        1        0
# 3:  3        2        2        1
# 4:  4        1        0        2
# 5:  5        0        0        1
# 6:  6        0        0        0
# 7:  7        2        1        2
# 8:  8        1        2        1
# 9:  9        0        1        4
#10: 10        2        4        5
#11: 11        1        5        1
#12: 12        4        2        1
#13: 13       NA        1        0
#14: 14       NA       NA        0
#15: 15       NA       NA        1

However, I recommend to continue any data processing in long format and use grouping. That's much easier than to work on individual columns. For instance,

# count observations by group
DF[, .N, by = ID]
#         ID  N
#1: sample_1 12
#2: sample_2 13
#3: sample_3 15

# compute mean by group
DF[, mean(total), by = ID]
#         ID       V1
#1: sample_1 1.166667
#2: sample_2 1.615385
#3: sample_3 1.266667

# get min and max by group
DF[, .(min = min(total), max = max(total)), by = ID]
#         ID min max
#1: sample_1   0   4
#2: sample_2   0   5
#3: sample_3   0   5

# the same using range()
DF[, as.list(range(total)), by = ID]
#         ID V1 V2
#1: sample_1  0  4
#2: sample_2  0  5
#3: sample_3  0  5

Data

DF <- structure(list(ID = c("sample_1", "sample_1", "sample_1", "sample_1", 
"sample_1", "sample_1", "sample_1", "sample_1", "sample_1", "sample_1", 
"sample_1", "sample_1", "sample_2", "sample_2", "sample_2", "sample_2", 
"sample_2", "sample_2", "sample_2", "sample_2", "sample_2", "sample_2", 
"sample_2", "sample_2", "sample_2", "sample_3", "sample_3", "sample_3", 
"sample_3", "sample_3", "sample_3", "sample_3", "sample_3", "sample_3", 
"sample_3", "sample_3", "sample_3", "sample_3", "sample_3", "sample_3"
), total = c(1L, 0L, 2L, 1L, 0L, 0L, 2L, 1L, 0L, 2L, 1L, 4L, 
2L, 1L, 2L, 0L, 0L, 0L, 1L, 2L, 1L, 4L, 5L, 2L, 1L, 0L, 0L, 1L, 
2L, 1L, 0L, 2L, 1L, 4L, 5L, 1L, 1L, 0L, 0L, 1L)), .Names = c("ID", 
"total"), row.names = c(NA, -40L), class = "data.frame")
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • "However, I recommend to continue any data processing in long format and use grouping. That's much easier than to work on individual columns." But working in wide format has the advantage that you can vectorise over rows, which may make your code a lot faster. You can't do that if your data items — the groups — occupy more than one row. – Phil van Kleur Mar 13 '20 at 11:30