Using the Nested List Column Approach and Purrr Together with Tidytext::Unnest_Tokens

Question

I have a dataframe that contains survey responses with each row representing a different person. One column - "Text" - is an open-ended text question. I would like to use Tidytext::unnest_tokens so that I do text analysis by each row, including sentiment scores, word counts, etc.

Here is the simple dataframe for this example:

Satisfaction<-c ("Satisfied","Satisfied","Dissatisfied","Satisfied","Dissatisfied")
Text<-c("I'm very satisfied with the services", "Your service providers are always late which causes me a lot of frustration", "You should improve your staff training, service providers have bad customer service","Everything is great!","Service is bad")
Gender<-c("M","M","F","M","F")
df<-data.frame(Satisfaction,Text,Gender)

I then turned the Text column into character...

df$Text<-as.character(df$Text)

Next I grouped by the id column and nested the dataframe.

df<-df%>%mutate(id=row_number())%>%group_by(id)%>%unnest_tokens(word,Text)%>%nest(-id)

Getting this far seems to have worked ok, but now how do I use purrr::map functions to work on the nested list column "word"? For example, if I want to create a new column using dplyr::mutate with word counts for each row?

Also, is there a better way to nest the dataframe so that only the "Text" column is a nested list?

It is not very clear what you want. You can do text analysis without having to use `purrr::nest`, just stop after `unnest_tokens`. If you want to nest only the word column you can do `nest(word)`, but for it to work you have to `ungroup` the data frame first (or do not group by id in the first place) — FlorianGD, Apr 03 '17 at 11:09

score 0 · Accepted Answer · answered Apr 07 '17 at 19:43

I love using purrr::map to do modeling for different groups, but for what you are talking about doing, I think you can stick with just straight dplyr.

You can set up your dataframe like this:

library(dplyr)
library(tidytext)

Satisfaction <- c("Satisfied",
                  "Satisfied",
                  "Dissatisfied",
                  "Satisfied",
                  "Dissatisfied")

Text <- c("I'm very satisfied with the services",
          "Your service providers are always late which causes me a lot of frustration", 
          "You should improve your staff training, service providers have bad customer service",
          "Everything is great!",
          "Service is bad")

Gender <- c("M","M","F","M","F")

df <- data_frame(Satisfaction, Text, Gender)

tidy_df <- df %>% 
    mutate(id = row_number()) %>% 
    unnest_tokens(word, Text)

Then to find, for example, the number of words per line, you can use group_by and mutate.

tidy_df %>%
    group_by(id) %>%
    mutate(num_words = n()) %>%
    ungroup
#> # A tibble: 37 × 5
#>    Satisfaction Gender    id      word num_words
#>           <chr>  <chr> <int>     <chr>     <int>
#> 1     Satisfied      M     1       i'm         6
#> 2     Satisfied      M     1      very         6
#> 3     Satisfied      M     1 satisfied         6
#> 4     Satisfied      M     1      with         6
#> 5     Satisfied      M     1       the         6
#> 6     Satisfied      M     1  services         6
#> 7     Satisfied      M     2      your        13
#> 8     Satisfied      M     2   service        13
#> 9     Satisfied      M     2 providers        13
#> 10    Satisfied      M     2       are        13
#> # ... with 27 more rows

You can do sentiment analysis by implementing an inner join; check out some examples here.

Thanks for the help and the examples! – Mike Apr 08 '17 at 05:18 — Mike, Apr 08 '17 at 05:18

Using the Nested List Column Approach and Purrr Together with Tidytext::Unnest_Tokens

1 Answers1