0

I have a data set (~200K rows) similar to this mock data:

mydf <- read.csv(url("http://pastebin.com/raw.php?i=YWTW98Pu"), header=T)

I am trying to widen it from its current form to what I fully understand is an untidy, bad schema (external requirements, etc).

Specifically, what I am looking to do is have one row per student with each duplicate field numbered--i.e.

| StudentID | Major     | University | Birthday | EnrollmentDate | CourseID.1 | CourseStartDate.1 | CourseEndDate.1 | CourseDescription.1 | Instructor.1 | Hours.1 | CourseID.2 | CourseStartDate.2 | CourseEndDate.2 | CourseDescription.2 | Instructor.2   | Hours.2 | CourseID.3... (etc) |
|-----------+-----------+------------+----------+----------------+------------+-------------------+-----------------+---------------------+--------------+---------+------------+-------------------+-----------------+---------------------+----------------+---------+---------------------|
|         1 | Economics | Oxford     | 4/9/1956 | 9/1/2001       |        100 | 8/15/2014         | 8/15/2014       | Stats With Cats     | Charlie Kufs |       3 |        101 | 8/16/2014         | 8/16/2014       | Fun with Cthulhu    | James Hatfield |       1 |                     |

Problems I've encountered are that I want the course variables to be numbered ordinally--i.e. 1, 2, 3, 4...n per student. That is, for each course they took, I want the column name to relate to the order they took the course in as opposed to the order it is labeled by specific dates or course ID.

The reshaping examples I have seen all want to name the widened columns by the actual value--e.g. EnrollmentDate9/1/2001.

  • @ alon bassok:your feedback on different answers will be appreciated. – rnso Aug 21 '14 at 04:31
  • How can same StudentID have different Birthdays and different EnrollmentDates (in different rows)? For example see first 4 rows (StudentID =1 but different birthdays in each row). – rnso Aug 21 '14 at 12:04
  • @rnso, birthdays and enrollment dates should not have differed. That was an error in the way the dummy data was created. – alon bassok Aug 21 '14 at 20:12

3 Answers3

2

Your sample is a bit odd because the Birthday and EnrollmentDate values vary along StudentID. So I ended up dropping them for this transformation because that didn't allow for collapsing.

So basically I just needed to add an ID to make Student/Class unique. Then I just used the base reshape function.

mydf <- read.csv(url("http://pastebin.com/raw.php?i=YWTW98Pu"), header=T)
reshape(
    transform(mydf, StClID = ave(1:nrow(mydf), StudentID, FUN=seq_along)),
    timevar = "StClID", 
    idvar = names(mydf)[1:3], 
    v.names=names(mydf)[6:11], 
    drop=names(mydf)[4:5], 
    direction = "wide"
)

The first few columns of this result is

   StudentID        Major     University CourseID.1 CourseStartDate.1
1          1    Economics         Oxford        100         8/15/2014
5          2 Anthropology Phoenix Online        100         8/15/2014
15         3        Music        Harvard        100         8/15/2014
22         4  Engineering          DeVry        100         8/15/2014
25         5          Art     Bob Ross U        100         8/15/2014
   CourseEndDate.1 Course.Description.1
1        8/15/2014      Stats With Cats
5        8/15/2014      Stats With Cats
15       8/15/2014      Stats With Cats
22       8/15/2014      Stats With Cats
25       8/15/2014      Stats With Cats
MrFlick
  • 195,160
  • 17
  • 277
  • 295
1

This is a little messy, but it should be fairly efficient with the size of your data. @MrFlicks answer is better, but I had already started working on this problem and it gives a different approach.

#Set object as data.table
require(data.table)
setDT(mydf)

#Convert vars to character
mydf <- mydf[ , lapply(.SD,as.character)]

#Group up vars that change from course to course
mydf2 <- mydf[ , list(list(.SD)), by=list(StudentID,Major,University)]

#Arrange variables that change over time
oth.vars <- rbindlist(lapply(mydf2$V1,function(x) as.data.table(t(unlist(x)))),fill=TRUE)
setcolorder(oth.vars,names(oth.vars)[order(names(oth.vars))])

#Recombine data
mydf2 <- cbind(mydf2,oth.vars)
mydf2[ , V1 := NULL]

And a bit of the output:

   StudentID        Major     University  Birthday1 Birthday10  Birthday2  Birthday3  Birthday4  Birthday5
1:         1    Economics         Oxford   4/9/1956         NA  4/10/1956  4/11/1956  4/12/1956         NA
2:         2 Anthropology Phoenix Online 12/15/1970 12/24/1970 12/16/1970 12/17/1970 12/18/1970 12/19/1970
3:         3        Music        Harvard  7/30/1967         NA  7/31/1967   8/1/1967   8/2/1967   8/3/1967
4:         4  Engineering          DeVry 12/11/1978         NA 12/12/1978 12/13/1978         NA         NA
5:         5          Art     Bob Ross U  7/25/1985         NA  7/26/1985  7/27/1985  7/28/1985  7/29/1985
Mike.Gahan
  • 4,565
  • 23
  • 39
1

Following may be useful, though all details are combined in one column (which can be separated using strsplit if needed):

firstpart = paste(mydf[,1],mydf[,2],mydf[,3],mydf[,4],mydf[,5],sep=",")
secondpart = paste(mydf[,6],mydf[,7],mydf[,8],mydf[,9],mydf[,10],mydf[,11],sep=",")

duplist = which(duplicated(mydf[,1]))
entrystr = ""
outdf = data.frame(ALLDETAILS=character(), stringsAsFactors=F)
for(i in 1:nrow(mydf)){
    if(i %in% duplist){
        entrystr=paste(entrystr, secondpart[i], sep=';')
    }
    else {
        if(i>1)outdf[nrow(outdf)+1,]=entrystr;
        entrystr=paste(firstpart[i], secondpart[i], sep=';')
    }
}
outdf[nrow(outdf)+1,]=entrystr;
outdf

                                                                                                                                           ALLDETAILS
1                                                                                                                                                                                                                                                                                                                                                                  1,Economics,Oxford,4/9/1956,9/1/2001;100,8/15/2014,8/15/2014,Stats With Cats,Charlie Kufs,3;101,8/16/2014,8/16/2014,Fun with Cthulhu,James Hatfield,1;102,8/17/2014,8/17/2014,The Spaghetti Monster and U,Bobby Henderson,3;103,8/18/2014,8/18/2014,Cake for Breakfast,Bill Cosby,3
2 2,Anthropology,Phoenix Online,12/15/1970,8/15/2003;100,8/15/2014,8/15/2014,Stats With Cats,Charlie Kufs,3;101,8/16/2014,8/16/2014,Fun with Cthulhu,James Hatfield,1;102,8/17/2014,8/17/2014,The Spaghetti Monster and U,Bobby Henderson,3;103,8/18/2014,8/18/2014,Cake for Breakfast,Bill Cosby,3;104,8/19/2014,8/19/2014,Flattening Ones Wang,Thomas Hendry,4;105,8/20/2014,8/20/2014,Lemon Party Home Economics,John Holmes,1;106,8/21/2014,8/21/2014,Paint By Numbers,Max Klein,1;107,8/22/2014,8/22/2014,Where IS Waldo?,Martin Handford,3;108,8/23/2014,8/23/2014,Drugs Not Hugs,Nancy Reagan,1;109,8/24/2014,8/24/2014,Whirled Peas,Bo (dog),3
3                                                                                                                                                                                3,Music,Harvard,7/30/1967,9/27/1999;100,8/15/2014,8/15/2014,Stats With Cats,Charlie Kufs,3;101,8/16/2014,8/16/2014,Fun with Cthulhu,James Hatfield,1;102,8/17/2014,8/17/2014,The Spaghetti Monster and U,Bobby Henderson,3;103,8/18/2014,8/18/2014,Cake for Breakfast,Bill Cosby,3;104,8/19/2014,8/19/2014,Flattening Ones Wang,Thomas Hendry,4;105,8/20/2014,8/20/2014,Lemon Party Home Economics,John Holmes,1;106,8/21/2014,8/21/2014,Paint By Numbers,Max Klein,1
4                                                                                                                                                                                                                                                                                                                                                                                                                      4,Engineering,DeVry,12/11/1978,1/16/1949;100,8/15/2014,8/15/2014,Stats With Cats,Charlie Kufs,3;101,8/16/2014,8/16/2014,Fun with Cthulhu,James Hatfield,1;102,8/17/2014,8/17/2014,The Spaghetti Monster and U,Bobby Henderson,3
5                                                                                                                                                                                                                                     5,Art,Bob Ross U,7/25/1985,6/5/2008;100,8/15/2014,8/15/2014,Stats With Cats,Charlie Kufs,3;101,8/16/2014,8/16/2014,Fun with Cthulhu,James Hatfield,1;102,8/17/2014,8/17/2014,The Spaghetti Monster and U,Bobby Henderson,3;103,8/18/2014,8/18/2014,Cake for Breakfast,Bill Cosby,3;104,8/19/2014,8/19/2014,Flattening Ones Wang,Thomas Hendry,4;105,8/20/2014,8/20/2014,Lemon Party Home Economics,John Holmes,1

The data is as follows:

mydf = structure(list(StudentID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 
5L, 5L, 5L, 5L, 5L, 5L), Major = structure(c(3L, 3L, 3L, 3L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Anthropology", 
"Art", "Economics", "Engineering", "Music"), class = "factor"), 
    University = structure(c(4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 
    5L, 5L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 
    2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Bob Ross U", "DeVry", 
    "Harvard", "Oxford", "Phoenix Online"), class = "factor"), 
    Birthday = structure(c(17L, 14L, 15L, 16L, 4L, 5L, 6L, 7L, 
    8L, 9L, 10L, 11L, 12L, 13L, 23L, 25L, 26L, 27L, 28L, 29L, 
    30L, 1L, 2L, 3L, 18L, 19L, 20L, 21L, 22L, 24L), .Label = c("12/11/1978", 
    "12/12/1978", "12/13/1978", "12/15/1970", "12/16/1970", "12/17/1970", 
    "12/18/1970", "12/19/1970", "12/20/1970", "12/21/1970", "12/22/1970", 
    "12/23/1970", "12/24/1970", "4/10/1956", "4/11/1956", "4/12/1956", 
    "4/9/1956", "7/25/1985", "7/26/1985", "7/27/1985", "7/28/1985", 
    "7/29/1985", "7/30/1967", "7/30/1985", "7/31/1967", "8/1/1967", 
    "8/2/1967", "8/3/1967", "8/4/1967", "8/5/1967"), class = "factor"), 
    EnrollmentDate = structure(c(23L, 24L, 29L, 30L, 13L, 14L, 
    15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 25L, 26L, 27L, 28L, 
    1L, 2L, 3L, 4L, 5L, 6L, 8L, 9L, 10L, 11L, 12L, 7L), .Label = c("10/1/1999", 
    "10/2/1999", "10/3/1999", "1/16/1949", "1/17/1949", "1/18/1949", 
    "6/10/2008", "6/5/2008", "6/6/2008", "6/7/2008", "6/8/2008", 
    "6/9/2008", "8/15/2003", "8/16/2003", "8/17/2003", "8/18/2003", 
    "8/19/2003", "8/20/2003", "8/21/2003", "8/22/2003", "8/23/2003", 
    "8/24/2003", "9/1/2001", "9/2/2001", "9/27/1999", "9/28/1999", 
    "9/29/1999", "9/30/1999", "9/3/2001", "9/4/2001"), class = "factor"), 
    CourseID = c(100L, 101L, 102L, 103L, 100L, 101L, 102L, 103L, 
    104L, 105L, 106L, 107L, 108L, 109L, 100L, 101L, 102L, 103L, 
    104L, 105L, 106L, 100L, 101L, 102L, 100L, 101L, 102L, 103L, 
    104L, 105L), CourseStartDate = structure(c(1L, 2L, 3L, 4L, 
    1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 
    5L, 6L, 7L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("8/15/2014", 
    "8/16/2014", "8/17/2014", "8/18/2014", "8/19/2014", "8/20/2014", 
    "8/21/2014", "8/22/2014", "8/23/2014", "8/24/2014"), class = "factor"), 
    CourseEndDate = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 
    5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 
    2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("8/15/2014", 
    "8/16/2014", "8/17/2014", "8/18/2014", "8/19/2014", "8/20/2014", 
    "8/21/2014", "8/22/2014", "8/23/2014", "8/24/2014"), class = "factor"), 
    Course.Description = structure(c(7L, 4L, 8L, 1L, 7L, 4L, 
    8L, 1L, 3L, 5L, 6L, 9L, 2L, 10L, 7L, 4L, 8L, 1L, 3L, 5L, 
    6L, 7L, 4L, 8L, 7L, 4L, 8L, 1L, 3L, 5L), .Label = c("Cake for Breakfast", 
    "Drugs Not Hugs", "Flattening Ones Wang", "Fun with Cthulhu", 
    "Lemon Party Home Economics", "Paint By Numbers", "Stats With Cats", 
    "The Spaghetti Monster and U", "Where IS Waldo?", "Whirled Peas"
    ), class = "factor"), Instructor = structure(c(4L, 5L, 2L, 
    1L, 4L, 5L, 2L, 1L, 10L, 6L, 8L, 7L, 9L, 3L, 4L, 5L, 2L, 
    1L, 10L, 6L, 8L, 4L, 5L, 2L, 4L, 5L, 2L, 1L, 10L, 6L), .Label = c("Bill Cosby", 
    "Bobby Henderson", "Bo (dog)", "Charlie Kufs", "James Hatfield", 
    "John Holmes", "Martin Handford", "Max Klein", "Nancy Reagan", 
    "Thomas Hendry"), class = "factor"), Hours = c(3L, 1L, 3L, 
    3L, 3L, 1L, 3L, 3L, 4L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 3L, 3L, 
    4L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 3L, 3L, 4L, 1L)), .Names = c("StudentID", 
"Major", "University", "Birthday", "EnrollmentDate", "CourseID", 
"CourseStartDate", "CourseEndDate", "Course.Description", "Instructor", 
"Hours"), class = "data.frame", row.names = c(NA, -30L))
rnso
  • 23,686
  • 25
  • 112
  • 234
  • this approach worked very well too. Are there any pitfalls to combining all of the data and then separating them again? – alon bassok Aug 22 '14 at 04:27
  • There are no obvious pitfalls. It is always better to combine related data like person's details eg ID, birthday etc, so that one does not have to repeat them for every entry. If this approach worked, an upvote will tell other viewers about its usefulness! – rnso Aug 22 '14 at 06:00