Questions tagged [tidytext]

The tidytext package provides tools for text mining using tidy data principles in R.

The R tidytext package, developed by Julia Silge and David Robinson, provides functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages. When text is in a tidy data structure, tools from the R tidyverse ecosystem like can be used for effective data handling and analysis.

Repositories

Vignettes

Other resources

Related tags

294 questions
3
votes
1 answer

Unicode characters not showing after using 'str_extract_all' function (stringr) in Rstudio

I am trying to extract a series of words from a series of .txt documents with the 'str_extract_all' stringr function. Everything works well except that the results I get do not show Unicode characters (which are fine in the UTF-8 texts where the…
3
votes
1 answer

Tidyverse unnest_tokens does not work inside function

I have a unnest_tokens function that works in the code, but once I put it into a function I cannot get it to work. I don't understand why this happens when I put it inside a function. data: id words 1 why is this function not…
Dennis Loos
  • 113
  • 2
  • 9
3
votes
3 answers

R: Error in UseMethod("tbl_vars")

So I'm running the code below in R Studio and getting this error: Error in UseMethod("tbl_vars") : no applicable method for 'tbl_vars' applied to an object of class "character" I don't know how to fix it cause there is no tbl_vars function! Can…
carmem
  • 33
  • 1
  • 1
  • 5
3
votes
2 answers

R: Opposite to aggregate using tidytext::unnest_tokens. Multiple variables and upper case

Following up on this question, I want to perform a task opposite to aggregate (or the data.table equivalent as in the MWE below), so that I obtain df1 again, starting from df2. The task here then is to reproduce df1 from df2. For this, I tried…
DaniCee
  • 2,397
  • 6
  • 36
  • 59
3
votes
4 answers

Does tidytext::unnest_tokens works with spanish characters?

I am trying to use unnest_tokens with spanish text. It works fine with unigrams, but breaks the special characters with bigrams. The code works fine on Linux. I added some info on the locale. library(tidytext) library(dplyr) df <- data_frame( …
rlabuonora
  • 31
  • 2
3
votes
1 answer

R tidytext stop_words are not filtering consistently from gutenbergr downloads

This is a bizarre puzzle. I downloaded 2 texts from gutenbergr - Alice in Wonderland and Ulysses. The stop_words disappear from Alice but they are still in Ulysses. This issue persisted even when replacing anti_join with filter (!word %in%…
3
votes
1 answer

dplyr unnest_tokens not working

I am loading one of the 5-core datasets from http://jmcauley.ucsd.edu/data/amazon/ using library(sparklyr) library(dplyr) config <- spark_config() config$`sparklyr.shell.driver-memory` <- "2G" sc = spark_connect(master = "local",config =…
AngryR11
  • 93
  • 1
  • 6
3
votes
1 answer

replace string from tibble with part of that string

I have searched a lot of regex answers here, but can't find the solution to this kind of problem. My dataset is a tibble with wikipedia links: library(tidytext) library(stringr) text.raw <- "Berthold Speer was een [[Duitsland…
raoul
  • 197
  • 3
  • 14
3
votes
2 answers

Removing stop words with tidytext

Using tidytext, I have this code: data(stop_words) tidy_documents <- tidy_documents %>% anti_join(stop_words) I want it to use the stop words built into the package to write a dataframe called tidy_documents into a dataframe of the same name,…
Simon Lindgren
  • 2,011
  • 12
  • 32
  • 46
2
votes
2 answers

Extracting mixed date from string in R

I have a vector of characters that looks like the table below, I would like to extract the dates from them and convert them as.Date. For example, row one would be 09-11-2021. The last number in the string is the number of columns and not part of the…
2
votes
1 answer

Remove Numbers, Punctuations, White Spaces before Tokenization

I have the following data frame report <- data.frame(Text = c("unit 1 crosses the street", "driver 2 was speeding and saw driver# 1", "year 2019 was the year before the pandemic", "hey saw hei hei in the …
S Das
  • 3,291
  • 6
  • 26
  • 41
2
votes
1 answer

Most commonly mentioned countries in the corpus; extracting country names from abstracts R

I have a corpus of a couple of thousand documents and I'm trying to find the most commonly mentioned countries in the abstracts. The library countrycode seems to have a comprehensive list of country names I can match against: # country.name.alt…
QAsena
  • 603
  • 4
  • 9
2
votes
1 answer

Expand tibble of email dataset in R

I have a massive tibble of my email data which looks like the following: library(dplyr) emails <- tibble( from = c('employee.1@xtra.co','employee.5@xtra.co','employee.1@xtra.co', 'employee.3@xtra.co','employee.1@xtra.co'), to =…
M.Qasim
  • 1,827
  • 4
  • 33
  • 58
2
votes
2 answers

`str_replace_all()` on html output (from `huxtable()`)

My R code generates some html output which I'd like to make two very simple "find and replace" type adjustments to: instead of R2 in the html, I'd like to replace with R2 intead of [number] *** in the html, I'd like to replace with…
Jeremy K.
  • 1,710
  • 14
  • 35
2
votes
1 answer

tidytext error (Error in is_corpus_df(corpus) : ncol(corpus) >= 2 is not TRUE)

I am trying to do some basic text analysis. After installing the 'tidytext' package, I tried to unnest my data frame, but I keep getting an error. I assume there is some package I am missing, but I am not sure how to figure out which. Any…
Susan Ray
  • 37
  • 3
1 2
3
19 20