R: Why do I lose data with spread()?

Question

I have a tibble that looks like this.

# A tibble: 1,000 x 3
   id                 question                               answer                                                          
   <chr>                  <chr>                                <chr>                                                                     
 1 aaa               What is your favorite color?                Green                                                                        
 2 aaa               What is your favorite band?                 Green Day                                                       
 3 aaabb             What is your favorite color?                Blue                                                                                
 4 aaabb             What is your favorite band?                Blue            
 5 ccc               What is your favorite color?                Blue                                                                        
 6 ccc               What is the difference between you and me?  Five bank accounts                                             
# ... with more rows

I'd like to expand it into a wide data frame. I used this code.

aTibble %>% distinct() %>%  spread(question, answer)

But, I end up with a data frame that is filled with empty rows.

  # A tibble: 1,000 x 3
       id                 V1              What is your favorite color?   What is your favorite band?   What is the difference between you and me?                                                 
     1 aaa                               NA                              NA                            NA                                                        
     2 aaa                               NA                              NA                            NA                         
     3 aaabb                             NA                              NA                            NA                                                
     4 aaabb                             NA                              NA                            NA
     5 ccc                               NA                              NA                            NA                                          
     6 ccc                               NA                              NA                            NA               
    # ... with more rows

In the original tibble, some rows have the ID and then null for question and answer. There are no duplicate questions for a single ID. That said, different IDs can answer different questions, they don't all have the same questions.

Additionally, I didn't make the V1 row and that wasn't in my original tibble. It appeared after the spread().

The frustrating part is that when I do the function on a small dataset, it works just fine. When I do the function on the full dataset (~150K records), I get NAs.

score 2 · Answer 1 · answered Feb 26 '19 at 05:18

2

It is hard to see why that doesn't work. dcast is a good alternative to use from reshape2. You can achieve the same thing.

aTibble %>% distinct() %>% dcast(id ~ question, value.var = "answer")

answered Feb 26 '19 at 05:18

Croote

1,382
1
7
15

R: Why do I lose data with spread()?

1 Answers1