How can I insert values into a data frame dynamically using R

Question

After scraping some review data from a website, I am having difficulty organizing the data into a useful structure for analysis. The problem is that the data is dynamic, in that each reviewer gave ratings on anywhere between 0 and 3 subcategories (denoted as subcategories "a", "b" and "c"). I would like to organize the reviews so that each row is a different reviewer, and each column is a subcategory that was rated. Where reviewers chose not to rate a subcategory, I would like that missing data to be 'NA'. Here is a simplified sample of the data:

vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)

The vec contains the information of the subcategories that were scored, and the "stop" is the end of each reviewers rating. As such, I would like to organize the result into a data frame with this structure. Expected Output

I would greatly appreciate any help on this, because I've been working on this issue for far longer than it should take me..

Evan Friedland · Accepted Answer · 2017-07-26T20:20:55.893

4

@alexis_laz provided what I believe is the best answer:

vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2) 

stops <- vec == "stop"
i = cumsum(stops)[!stops] + 1L
j = vec[!stops]
tapply(ratings, list(factor(i, 1:max(i)), factor(j)), identity) # although mean/sum work  
#      a  b  c
#[1,]  2  5  1
#[2,]  1  3 NA
#[3,] NA NA NA
#[4,] NA NA  2

edited Jul 26 '17 at 20:20

answered Jul 26 '17 at 18:51

Evan Friedland

3,062
1
11
25

3

I believe a clearer version of this logic could be `stops = vec == "stop"; i = cumsum(stops)[!stops] + 1L; j = vec[!stops]; tapply(ratings, list(factor(i, 1:max(i)), factor(j)), identity)`? – alexis_laz Jul 26 '17 at 20:00
That is incredible concise and I myself am not familiar with the tapply identity use. I think your comment is a clear winner. – Evan Friedland Jul 26 '17 at 20:08

Alex P · Answer 2 · 2017-07-26T19:45:02.443

3

base R, but I'm using a for loop...

vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2) 
categories <- unique(vec)[unique(vec)!="stop"]

row = 1
df = data.frame(lapply(categories, function(x){NA_integer_}))
colnames(df) <- categories
rating = 1

for(i in vec) {  
  if(i=='stop') {row <- row+1
  } else { df[row,i] <- ratings[[rating]]; rating <- rating+1}
}

edited Jul 26 '17 at 19:45

answered Jul 26 '17 at 19:17

Alex P

1,574
13
28

This is a really good option if you know the "a","b","c" before hand. – Evan Friedland Jul 26 '17 at 19:31
I guess we could build and fill in `df` using `unique(vec)` – Alex P Jul 26 '17 at 19:35
ok, @EvanFriedland , amended the answer so you don't need to know or specify the categories explicitly. – Alex P Jul 26 '17 at 19:49
1

I'm not the OP but I am giving you +1 because I like it :) – Evan Friedland Jul 26 '17 at 19:52

score 2 · Answer 3 · answered Jul 26 '17 at 18:53

Here is one option

library(data.table)
library(reshape2)
d1 <- as.data.table(melt(split(vec, c(1, head(cumsum(vec == "stop")+1, 
        -1)))))[value != 'stop', ratings := ratings      
        ][value != 'stop'][, value := as.character(value)][, L1 := as.integer(L1)]


dcast( d1[CJ(value = value, L1 = seq_len(max(L1)), unique = TRUE), on = .(value, L1)], 
           L1 ~value, value.var = 'ratings')[, L1 := NULL][]
#    a  b  c
#1:  2  5  1
#2:  1  3 NA
#3: NA NA NA
#4: NA NA  2

lmo · Answer 4 · 2017-07-27T11:50:41.527

Using base R functions and rbind.fill from plyr or rbindlist from data.table to produce the final object, we can do

# convert vec into a list, split by "stop", dropping final element
temp <- head(strsplit(readLines(textConnection(paste(gsub("stop", "\n", vec, fixed=TRUE),
                                                     collapse=" "))), split=" "), -1)
# remove empty strings, but maintain empty list elements
temp <- lapply(temp, function(x) x[nchar(x) > 0])
# match up appropriate names to the individual elements in the list with setNames
# convert vectors to single row data.frames
temp <- Map(function(x, y) setNames(as.data.frame.list(x), y),
            relist(ratings, skeleton = temp), temp)

# add silly data.frame (single row, single column) for any empty data.frames in list
temp <- lapply(temp, function(x) if(nrow(x) > 0) x else setNames(data.frame(NA), vec[1]))

Now, you can produce the single data.frame (data.table) with either plyr or data.table

# with plyr, returns data.frame
library(plyr)
do.call(rbind.fill, temp)
   a  b  c
1  2  5  1
2  1  3 NA
3 NA NA NA
4 NA NA  2

# with data.table, returns data.table
 rbindlist(temp, fill=TRUE)
    a  b  c
1:  2  5  1
2:  1  3 NA
3: NA NA NA
4: NA NA  2

Note that the line prior to the rbinding can be replaced with

temp[lengths(temp) == 0] <- replicate(sum(lengths(temp) == 0),
                                      setNames(data.frame(NA), vec[1]), simplify=FALSE)

where the list items that are empty data frames are replaced using subsetting instead of an lapply over the entire list.

Only issue here is you miss the expected output data of 3rd reviewer who didn't give any ratings. — Evan Friedland, Jul 26 '17 at 19:29
@EvanFriedland Good call. I missed that. Added an extra line to fix that. — lmo, Jul 26 '17 at 19:42
I riffed off an answer you gave earlier on dropping accumulated duplicates. Quite a cool concept that you can use the skeleton argument to match the structure of another object, simultaneously filling it with new material. — lmo, Jul 26 '17 at 19:58

How can I insert values into a data frame dynamically using R

4 Answers4