Opposite of unnest_tokens after creating dummy variable

Question

library(NLP)
library(tm)
library(tidytext)
library(tidyverse)
library(topicmodels)
library(dplyr)
library(stringr)
library(purrr)
library(tidyr)
#sample dataset
tags <- c("product, productdesign, electronicdevice")
web <- c("hardware, sunglasses, eyeware")
tags2 <- data_frame(tags, web, stringsAsFactors = FALSE)
#tokenize the words
toke <- tags2 %>%
  unnest_tokens(word, tags)
toke
#create a dummy variable
toke2 <- toke%>% mutate(
  product = ifelse(str_detect(word, "^product$"), "1", "0"))
#unnest the toke
nested_toke <- toke2 %>%
  nest(word) %>%
  mutate(text = map(data, unlist), 
         text = map_chr(text, paste, collapse = " "))

nested_toke %>%
  select(text)

When I nest the column of tokenized words after creating the dummy variable based on the string "product" it seems to be inserting "product" into a new row below the original row where "product" was located.

product underlined should be in the row above

I would start a new session and try the code again. This runs fine for me. — Jake Kaupp, Feb 20 '18 at 21:24
Hey @JakeKaupp sorry I had accidentally included the words as separate values when they should be all included in the same row. I edited the syntax to replicate the issue. Sorry I should have checked this before. Thanks for the response! — Kreitz Gigs, Feb 21 '18 at 03:02
You don't need to `nest` then `mutate` with `purrr`, you could `group_by` and `summarize(text = paste(words, collapse = " "))` instead. This may get rid of your extra groups. — Jake Kaupp, Feb 21 '18 at 03:18
Hey @JakeKaupp thanks for the reply again. I tried the following syntax nested_toke <- toke2 %>% group_by(word) %>% summarize(text = paste(words, collapse = " ")) But it seems to have dropped the dummy variable column "product" — Kreitz Gigs, Feb 21 '18 at 03:52
You have to include product in the group if you want to keep it with `summarize` — Jake Kaupp, Feb 21 '18 at 11:10

score 1 · Answer 1 · answered Feb 23 '18 at 05:05

When you add a new column after unnesting, you have to think about what to do with it if you want to nest again. Let's work through it and see what we're talking about.

library(tidyverse)
tags <- c("product, productdesign, electronicdevice")
web <- c("hardware, sunglasses, eyeware")
tags2 <- data_frame(tags, web)

library(tidytext)
tidy_tags <- tags2 %>%
    unnest_tokens(word, tags)
tidy_tags
#> # A tibble: 3 x 2
#>   web                           word            
#>   <chr>                         <chr>           
#> 1 hardware, sunglasses, eyeware product         
#> 2 hardware, sunglasses, eyeware productdesign   
#> 3 hardware, sunglasses, eyeware electronicdevice

So that is your data set unnested, converted to a tidy form. Next, let's add the new column that detects whether the word "product" is in the word column.

tidy_product <- tidy_tags %>% 
    mutate(product = ifelse(str_detect(word, "^product$"), 
                            TRUE, 
                            FALSE))
tidy_product
#> # A tibble: 3 x 3
#>   web                           word             product
#>   <chr>                         <chr>            <lgl>  
#> 1 hardware, sunglasses, eyeware product          T      
#> 2 hardware, sunglasses, eyeware productdesign    F      
#> 3 hardware, sunglasses, eyeware electronicdevice F

Now think about what your options are for nesting again. If you nest again without taking into account the new column (nest(word)) the structure has a NEW COLUMN and will have to make a NEW ROW to account for the two different values that can take. You could instead do something like nest(word, product) but then the TRUE/FALSE values will end up in your text string. If you are wanting to get back to the original text format, you need to remove the new column you created, because having it there changes the relationships between rows and columns.

nested_product <- tidy_product %>%
    select(-product) %>%
    nest(word) %>%
    mutate(text = map(data, unlist), 
           text = map_chr(text, paste, collapse = ", "))

nested_product
#> # A tibble: 1 x 3
#>   web                           data             text                     
#>   <chr>                         <list>           <chr>                    
#> 1 hardware, sunglasses, eyeware <tibble [3 × 1]> product, productdesign, …

Created on 2018-02-22 by the reprex package (v0.2.0).

Opposite of unnest_tokens after creating dummy variable

1 Answers1