3

After scraping some review data from a website, I am having difficulty organizing the data into a useful structure for analysis. The problem is that the data is dynamic, in that each reviewer gave ratings on anywhere between 0 and 3 subcategories (denoted as subcategories "a", "b" and "c"). I would like to organize the reviews so that each row is a different reviewer, and each column is a subcategory that was rated. Where reviewers chose not to rate a subcategory, I would like that missing data to be 'NA'. Here is a simplified sample of the data:

vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2) 

The vec contains the information of the subcategories that were scored, and the "stop" is the end of each reviewers rating. As such, I would like to organize the result into a data frame with this structure. Expected Output

enter image description here

I would greatly appreciate any help on this, because I've been working on this issue for far longer than it should take me..

Dave Gruenewald
  • 5,329
  • 1
  • 23
  • 35

4 Answers4

4

@alexis_laz provided what I believe is the best answer:

vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2) 

stops <- vec == "stop"
i = cumsum(stops)[!stops] + 1L
j = vec[!stops]
tapply(ratings, list(factor(i, 1:max(i)), factor(j)), identity) # although mean/sum work  
#      a  b  c
#[1,]  2  5  1
#[2,]  1  3 NA
#[3,] NA NA NA
#[4,] NA NA  2
Evan Friedland
  • 3,062
  • 1
  • 11
  • 25
  • 3
    I believe a clearer version of this logic could be `stops = vec == "stop"; i = cumsum(stops)[!stops] + 1L; j = vec[!stops]; tapply(ratings, list(factor(i, 1:max(i)), factor(j)), identity)`? – alexis_laz Jul 26 '17 at 20:00
  • That is incredible concise and I myself am not familiar with the tapply identity use. I think your comment is a clear winner. – Evan Friedland Jul 26 '17 at 20:08
3

base R, but I'm using a for loop...

vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2) 
categories <- unique(vec)[unique(vec)!="stop"]

row = 1
df = data.frame(lapply(categories, function(x){NA_integer_}))
colnames(df) <- categories
rating = 1

for(i in vec) {  
  if(i=='stop') {row <- row+1
  } else { df[row,i] <- ratings[[rating]]; rating <- rating+1}
}
Alex P
  • 1,574
  • 13
  • 28
2

Here is one option

library(data.table)
library(reshape2)
d1 <- as.data.table(melt(split(vec, c(1, head(cumsum(vec == "stop")+1, 
        -1)))))[value != 'stop', ratings := ratings      
        ][value != 'stop'][, value := as.character(value)][, L1 := as.integer(L1)]


dcast( d1[CJ(value = value, L1 = seq_len(max(L1)), unique = TRUE), on = .(value, L1)], 
           L1 ~value, value.var = 'ratings')[, L1 := NULL][]
#    a  b  c
#1:  2  5  1
#2:  1  3 NA
#3: NA NA NA
#4: NA NA  2
akrun
  • 874,273
  • 37
  • 540
  • 662
2

Using base R functions and rbind.fill from plyr or rbindlist from data.table to produce the final object, we can do

# convert vec into a list, split by "stop", dropping final element
temp <- head(strsplit(readLines(textConnection(paste(gsub("stop", "\n", vec, fixed=TRUE),
                                                     collapse=" "))), split=" "), -1)
# remove empty strings, but maintain empty list elements
temp <- lapply(temp, function(x) x[nchar(x) > 0])
# match up appropriate names to the individual elements in the list with setNames
# convert vectors to single row data.frames
temp <- Map(function(x, y) setNames(as.data.frame.list(x), y),
            relist(ratings, skeleton = temp), temp)

# add silly data.frame (single row, single column) for any empty data.frames in list
temp <- lapply(temp, function(x) if(nrow(x) > 0) x else setNames(data.frame(NA), vec[1]))

Now, you can produce the single data.frame (data.table) with either plyr or data.table

# with plyr, returns data.frame
library(plyr)
do.call(rbind.fill, temp)
   a  b  c
1  2  5  1
2  1  3 NA
3 NA NA NA
4 NA NA  2

# with data.table, returns data.table
 rbindlist(temp, fill=TRUE)
    a  b  c
1:  2  5  1
2:  1  3 NA
3: NA NA NA
4: NA NA  2

Note that the line prior to the rbinding can be replaced with

temp[lengths(temp) == 0] <- replicate(sum(lengths(temp) == 0),
                                      setNames(data.frame(NA), vec[1]), simplify=FALSE)

where the list items that are empty data frames are replaced using subsetting instead of an lapply over the entire list.

lmo
  • 37,904
  • 9
  • 56
  • 69
  • 1
    Only issue here is you miss the expected output data of 3rd reviewer who didn't give any ratings. – Evan Friedland Jul 26 '17 at 19:29
  • @EvanFriedland Good call. I missed that. Added an extra line to fix that. – lmo Jul 26 '17 at 19:42
  • 1
    An interesting option with `relist` – akrun Jul 26 '17 at 19:52
  • I riffed off an answer you gave earlier on dropping accumulated duplicates. Quite a cool concept that you can use the skeleton argument to match the structure of another object, simultaneously filling it with new material. – lmo Jul 26 '17 at 19:58