1

I have a data frame that looks like this:

ID rd_test_2011 rd_score_2011 mt_test_2011 mt_score_2011 rd_test_2012 rd_score_2012 mt_test_2012 mt_score_2012
1  A            80            XX           100           NA           NA            BB           45 
2  XX           90            NA           NA            AA           80            XX           80

I want to write a script that would, for IDs that don't have NA's in the yy_test_20xx columns, create a new data frame with the subject taken from the column title, the test name, the test score and year taken from the column title. So, in this example ID 1 would have three entries. Expected output would look like this:

ID   Subject    Test        Score        Year
1    rd         A           80           2011
1    mt         XX          100          2012
1    mt         BB          45           2012
2    rd         XX          90           2011
2    rd         AA          80           2012
2    mt         XX          80           2012

I've tried both reshape and various forms of merged.stack which works in the sense that I get an output that is on the road to being right but I can't understand the inputs well enough to get there all the way:

library(splitstackshape)
merged.stack(x, id.vars='id', var.stubs=c("rd_test","mt_test"), sep="_")

I've had more success (gotten closer) with reshape:

y<- reshape(x, idvar="id", ids=1:nrow(x), times=grep("test", names(x), value=TRUE), 
      timevar="year", varying=list(grep("test", names(x), value=TRUE), grep("score",
      names(x), value=TRUE)), direction="long", v.names=c("test", "score"),
      new.row.names=NULL) 
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
n8sty
  • 1,418
  • 1
  • 14
  • 26
  • 1. [What have you tried](http://mattgemmell.com/2008/12/08/what-have-you-tried/) so far? 2. Expected output would be useful. – zero323 Oct 17 '13 at 22:14
  • You're looking for `reshape` – Señor O Oct 17 '13 at 22:30
  • @SeñorO I'm having trouble identifying what the correct arguments for reshape are since the data frame I'm working with has about 100 columns and about 150000 rows, and I only want to reshape on some of the columns. – n8sty Oct 17 '13 at 22:50
  • Then post an example that has that additional feature! – IRTFM Oct 17 '13 at 23:26

3 Answers3

2

Using reshape:

 dat.long <- reshape(dat, direction="long",  varying=list(c(2, 4,6), c(3, 5,7)), 
                       times=2011:2013,timevar='Year',
                       sep="_", v.names=c("Test", "Score"))


dat.long[complete.cases(dat.long),]

      ID Year Test Score id
1.2011  1 2011    A    80  1
2.2011  2 2011   XX    90  2
4.2011  4 2011    A    50  4
5.2011  5 2011    C    50  5
1.2012  1 2012   XX   100  1
3.2012  3 2012    A    10  3
4.2012  4 2012   XX    60  4
5.2012  5 2012    A    75  5
2.2013  2 2013   AA    80  2
4.2013  4 2013   AA    99  4
agstudy
  • 119,832
  • 17
  • 199
  • 261
2

This will get your data into the right format:

df.long = reshape(df, idvar="ID", ids=1:nrow(df), times=grep("Test", names(df), value=TRUE),
 timevar="Year", varying=list(grep("Test", names(df), value=TRUE), 
grep("Score", names(df), value=TRUE)), direction="long", v.names=c("Test", "Score"),
new.row.names=NULL) 

Then omitting NA:

df.long = df.long[!is.na(df.long$Test),]

Then splitting Year to remove Test_:

df.long$Year = sapply(strsplit(df.long$Year, "_"), `[`, 2)

And ordering by ID:

df.long[order(df.long$ID),]

   ID Year Test Score
1   1 2011    A    80
5   1 2012   XX   100
2   2 2011   XX    90
9   2 2013   AA    80
6   3 2012    A    10
3   4 2011    A    50
7   4 2012   XX    60
10  4 2013   AA    99
4   5 2011    C    50
8   5 2012    A    75
Señor O
  • 17,049
  • 2
  • 45
  • 47
  • 1
    I suggest you replace times by this : `times=gsub('.*_','',grep("Test", names(dat), value=TRUE))` , you will get more general solution than mine and you omit the splitting year step. of course +1! – agstudy Oct 17 '13 at 23:11
  • Yeah I was under the impression `times` was doing column matching like varying. – Señor O Oct 17 '13 at 23:15
1

Considering your update, I've entirely rewritten this answer. View the history if you want to see the old version.

The main problem is that your data is "double wide" in a ways. Thus, you can actually solve your problem by reshaping in the "long" direction twice. Alternatively, use melt and *cast to melt your data in a very long format and convert it to a semi-wide format.

However, I would still suggest "splitstackshape" (and not just because I wrote it). It can handle this problem fine, but it needs you to rearrange your names of your data. The part of the name that will result in the names of the new columns should come first. In your example, that means "test" and "score" should be the first part of the variable name.

For this, we can use some gsub to rearrange the existing names.

library(splitstackshape)
setnames(mydf, gsub("(rd|mt)_(score|test)_(.*)", "\\2_\\1_\\3", names(mydf)))
names(mydf)
# [1] "ID"            "test_rd_2011"  "score_rd_2011" "test_mt_2011" 
# [5] "score_mt_2011" "test_rd_2012"  "score_rd_2012" "test_mt_2012" 
# [9] "score_mt_2012"
out <- merged.stack(mydf, "ID", var.stubs=c("test", "score"), sep="_")
setnames(out, c(".time_1", ".time_2"), c("Subject", "Year"))
out[complete.cases(out), ]
#    ID Subject Year test score
# 1:  1      mt 2011   XX   100
# 2:  1      mt 2012   BB    45
# 3:  1      rd 2011    A    80
# 4:  2      mt 2012   XX    80
# 5:  2      rd 2011   XX    90
# 6:  2      rd 2012   AA    80

For the benefit of others, "mydf" in this answer is defined as:

mydf <- structure(list(ID = 1:2, rd_test_2011 = c("A", "XX"), 
    rd_score_2011 = c(80L, 90L), mt_test_2011 = c("XX", NA), 
    mt_score_2011 = c(100L, NA), rd_test_2012 = c(NA, "AA"), 
    rd_score_2012 = c(NA, 80L), mt_test_2012 = c("BB", "XX"), 
    mt_score_2012 = c(45L, 80L)), 
    .Names = c("ID", "rd_test_2011", "rd_score_2011", "mt_test_2011", 
    "mt_score_2011", "rd_test_2012", "rd_score_2012", "mt_test_2012", 
    "mt_score_2012"), class = "data.frame", row.names = c(NA, -2L))
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485