achieve tokenize in a txt format with tidytext

Question

I'm trying to work on tidytext, with a .txt file called: texto_revision with the following structure:

# A tibble: 254 x 230
   X1     X2     X3     X4    X5    X6    X7    X8    X9    X10   X11   X12   X13   X14   X15   X16  
   <chr>  <chr>  <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1 la     expro~ de     la    tier~ ocur~ con   frec~ dura~ el    proc~ rapi~ de    la    urba~ en   
 2 como   las    difer~ en    el    moti~ del   cons~ cons~ en    esta~ unid~ y     china afec~ la   
 3 las    desig~ etnic~ en    los   patr~ de    cons~ (pre~ de    vest~ joye~ auto~ han   sido  obje~
 4 este   artic~ exami~ el    impa~ de    vari~ dife~ indi~ en    la    prop~ de    los   cons~ a    
 5 este   artic~ inves~ la    infl~ de    los   regi~ poli~ sobre la    impo~ 
 #   ...

When trying to use unnest_tokens format, with the following code:

library(tidytext)

texto_revision %>%
    unnest_tokens(word, text)

I get the following error:

Error: Error in check_input(x) : Input must be a character vector of any length or a list of character vectors, each of which has a length of 1.

To try to correct the error and continue with the tokenization ahead I tried to convert the text into a data frame with the following code:

text_df <- as.data.frame(texto_revision)

but I still get the following error

Error in check_input(x) : Input must be a character vector of any length or a list of character vectors, each of which has a length of 1.

Mikael Poul Johannesson · Answer 1 · 2018-05-11T03:50:31.530

1

It looks like your text is already tokenized, so you just need to melt the data frame to get the data structure you want. For instance,

library(tidyverse)

texto_revision %>%
  gather(document, word)

See the docs for tidyr::gather().

edited May 11 '18 at 03:50

answered May 11 '18 at 03:45

Mikael Poul Johannesson

1,319
7
12

score 0 · Accepted Answer · answered May 11 '18 at 01:11

0

Note thatthe syntax for unnest_tokens is "unnest_tokens([new column name],[reference column]." There appears to be no "text" column in your tibble/data frame. Below is a toy example to illustrate:

State <- as.character(c("SC is in the South","NC is in the south", 
                        "NY is in  the north"))
DF <- data.frame(State, stringsAsFactors = FALSE)

> DF
               State
 1 SC is in the South
 2 NC is in the south
 .....
 DF %>% unnest_tokens(word,State)

     word
1      sc
1.1    is
1.2    in
1.3   the
....

answered May 11 '18 at 01:11

Peter_Evan

947
10
17

I must create the object c (.... or can I import a data set and start working with unnest unnest_tokens? – Samir Ricardo Neme Chaves May 11 '18 at 03:28
1

"c (" in R is used to create vectors (i.e. combine observations). I only used it to make a toy example for illustrating the syntax. As Mikael points out, your data frame appears to already be in tokens with one word in each field. You may need to "tidy" your dataframe further by forcing your words into one column, which is why the you were pointed towards the gather function. – Peter_Evan May 11 '18 at 13:29

achieve tokenize in a txt format with tidytext

2 Answers2