Questions tagged [tidytext]

The tidytext package provides tools for text mining using tidy data principles in R.

The R tidytext package, developed by Julia Silge and David Robinson, provides functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages. When text is in a tidy data structure, tools from the R tidyverse ecosystem like can be used for effective data handling and analysis.

Repositories

Vignettes

Other resources

Related tags

294 questions
1
vote
2 answers

Find characters before and after dollar amount in vector of text data in R

I have a vector of text data (news data). I am trying to scan the text for any money amount and the text surrounding this amount. I managed this with the first element of my vector but struggle with using a loop and list to repeat the process for…
Marco
  • 2,368
  • 6
  • 22
  • 48
1
vote
1 answer

unnest_tokens and keep original columns (tidytext)

The unnest_tokens function of the package tidytext is supposed to keep the other columns of the dataframe (tibble) you pass to it. In the example provided by the authors of the package ("tidy_books" on Austen's data) it works fine, but I get some…
Dario Lacan
  • 1,099
  • 1
  • 11
  • 25
1
vote
1 answer

Error in R term frequency analysis (TF-IDF)

I tried to run the following code with the following data: library(dplyr) library(janeaustenr) library(tidytext) book_words <- austen_books() %>% unnest_tokens(word, text) %>% count(book, word, sort = TRUE) For this, I get this error…
Renée
  • 31
  • 4
1
vote
1 answer

Correlation and graph layout in widyr and ggraph when tidy text mining

I'm using a tutorial (https://www.tidytextmining.com/nasa.html?q=correlation%20ne#networks-of-keywords) to learn about tidy text mining. I am hoping someone might be able to help with two questions: in this tutorial, the correlation used to make…
Gabriella
  • 421
  • 3
  • 11
1
vote
2 answers

How to extract key phrases following specific characters using regex in R?

I have a dataframe that looks like so: ID | Tweet_ID | Tweet 1 12345 @sprintcare I did. 2 SPRINT @12345 Please send us a Private Message. 3 45678 @apple My information is incorrect. 4 APPLE @45678 What information is…
Dinho
  • 704
  • 4
  • 15
1
vote
3 answers

R: Text Mining, create list of words per document

I am reading in the text from a number of PDFs in a directory. Then, I split these texts into single words (tokens) using the tidytext::unnest_tokens()-function. Can someone please tell me, how I can add an additional column to the test-tibble with…
D. Studer
  • 1,711
  • 1
  • 16
  • 35
1
vote
1 answer

bind_tf_idf() error: in tapply(n, documents, sum) : arguments must have same length

I am trying to do bind_tf_idf() for the following df. My df has two documents/classes: Y or N. > test_2 # A tibble: 3,295 x 2 Class word 1 Y nature 2 Y great 3 Y are 4 Y present 5 N in 6…
1
vote
1 answer

R tidytext sentiment analysis- how to use the drop parameter

I recently asked a question about entries that are omitted after a sentiment analysis. The tweets that I analyse don't always contain words that are in the lexicon. I would like to know which ones can't be translated. So I would like to keep these…
Iarwain
  • 199
  • 10
1
vote
1 answer

How to efficiently handle big data in R for text mining

With the help of the tidytext package, I'm trying to count all bigrams and trigrams for a personal example. However, this personal dataset has +1 million lines (paragraphs really) and lots of words in each one. This is a memory-intensive process…
caproki
  • 348
  • 2
  • 18
1
vote
1 answer

Add detected topics to input data

library(dplyr) library(ggplot2) library(stm) library(janeaustenr) library(tidytext) library(quanteda) testDfm <- gadarian$open.ended.response %>% tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% dfm() out…
rek
  • 177
  • 7
1
vote
1 answer

usage of bind tf_df in R

library(janeaustenr) library(tidytext) library(tidyverse) library(tm) library(corpus) text <- removeNumbers(sensesensibility) text <- data.frame(text) tidy_text <- text %>%…
Vikram
  • 83
  • 7
1
vote
2 answers

Mapping the topic of the review in R

I have two data sets, Review Data & Topic Data Dput code of my Review Data structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved", "Sports and physical exercise need to be given importance"), class = "factor")), class =…
Suhas U
  • 43
  • 7
1
vote
0 answers

How to reorder facet_grid() columns based in R using ggplot2?

Don't think this is a duplicate of others, but happy to delete if it is. Dataset contains 3 columns: 'Recipient' (x-axis), 'Amount' (y-axis), and 'Department'(grid-column/fill). How can I re-order facet grids more intuitively in descending order by…
owlstone
  • 533
  • 1
  • 4
  • 11
1
vote
1 answer

Tokenizing word using tidytext - preserving punctuation

I've been trying to preserve punctation like "-" "(" "/" "'" when tokenizing word. data = tibble(title = "Computer-aided detection (1 / 2)") data %>% unnest_tokens(input = title, output = słowo, token =…
Pawliczek
  • 53
  • 5
1
vote
1 answer

R unnest_tokens elements from list

I have this: library(tidytext) list_chars <- list("you and I", "he or she", "we and they") list_chars_as_tibble <- lapply(list_chars, tibble) list_chars_by_word <- lapply(list_chars_as_tibble, unnest_tokens) got this: Error in check_input(x) : …
nasifffors
  • 25
  • 5