Questions tagged [tidytext]

The tidytext package provides tools for text mining using tidy data principles in R.

The R tidytext package, developed by Julia Silge and David Robinson, provides functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages. When text is in a tidy data structure, tools from the R tidyverse ecosystem like dplyr can be used for effective data handling and analysis.

Repositories

Vignettes

Other resources

Text Mining with R: A Tidy Approach

Related tags

R's tm, quanteda, dplyr, tidyr, and broom packages

294 questions

votes

1 answer

Unicode characters not showing after using 'str_extract_all' function (stringr) in Rstudio

I am trying to extract a series of words from a series of .txt documents with the 'str_extract_all' stringr function. Everything works well except that the results I get do not show Unicode characters (which are fine in the UTF-8 texts where the…

r utf-8 tidyverse stringr tidytext

asked Jan 03 '19 at 15:33

Laura Linares

votes

1 answer

Tidyverse unnest_tokens does not work inside function

I have a unnest_tokens function that works in the code, but once I put it into a function I cannot get it to work. I don't understand why this happens when I put it inside a function. data: id words 1 why is this function not…

r function tidyverse tidytext unnest

asked Jun 28 '18 at 03:47

Dennis Loos

votes

3 answers

R: Error in UseMethod("tbl_vars")

So I'm running the code below in R Studio and getting this error: Error in UseMethod("tbl_vars") : no applicable method for 'tbl_vars' applied to an object of class "character" I don't know how to fix it cause there is no tbl_vars function! Can…

r loops dplyr tidytext

asked May 23 '18 at 13:19

carmem

votes

2 answers

R: Opposite to aggregate using tidytext::unnest_tokens. Multiple variables and upper case

Following up on this question, I want to perform a task opposite to aggregate (or the data.table equivalent as in the MWE below), so that I obtain df1 again, starting from df2. The task here then is to reproduce df1 from df2. For this, I tried…

r reshape tidytext

asked Jan 05 '18 at 08:59

DaniCee

2,397
6
36
59

votes

4 answers

Does tidytext::unnest_tokens works with spanish characters?

I am trying to use unnest_tokens with spanish text. It works fine with unigrams, but breaks the special characters with bigrams. The code works fine on Linux. I added some info on the locale. library(tidytext) library(dplyr) df <- data_frame( …

r tidytext

asked Dec 08 '17 at 13:55

rlabuonora

votes

1 answer

R tidytext stop_words are not filtering consistently from gutenbergr downloads

This is a bizarre puzzle. I downloaded 2 texts from gutenbergr - Alice in Wonderland and Ulysses. The stop_words disappear from Alice but they are still in Ulysses. This issue persisted even when replacing anti_join with filter (!word %in%…

r stop-words tidytext anti-join

asked Nov 09 '17 at 19:14

Regis Maria O'Connor

votes

1 answer

dplyr unnest_tokens not working

I am loading one of the 5-core datasets from http://jmcauley.ucsd.edu/data/amazon/ using library(sparklyr) library(dplyr) config <- spark_config() config$`sparklyr.shell.driver-memory` <- "2G" sc = spark_connect(master = "local",config =…

r dplyr sparklyr tidytext

asked Aug 23 '17 at 22:20

AngryR11

votes

1 answer

replace string from tibble with part of that string

I have searched a lot of regex answers here, but can't find the solution to this kind of problem. My dataset is a tibble with wikipedia links: library(tidytext) library(stringr) text.raw <- "Berthold Speer was een [[Duitsland…

r regex stringr tidytext

asked Jul 07 '17 at 13:20

raoul

votes

2 answers

Removing stop words with tidytext

Using tidytext, I have this code: data(stop_words) tidy_documents <- tidy_documents %>% anti_join(stop_words) I want it to use the stop words built into the package to write a dataframe called tidy_documents into a dataframe of the same name,…

r dplyr tidyverse tidytext

asked Apr 16 '17 at 20:36

Simon Lindgren

2,011
12
32
46

votes

2 answers

Extracting mixed date from string in R

I have a vector of characters that looks like the table below, I would like to extract the dates from them and convert them as.Date. For example, row one would be 09-11-2021. The last number in the string is the number of columns and not part of the…

r gsub stringr tidytext

asked Dec 27 '22 at 15:59

I_like_insights

votes

1 answer

Remove Numbers, Punctuations, White Spaces before Tokenization

I have the following data frame report <- data.frame(Text = c("unit 1 crosses the street", "driver 2 was speeding and saw driver# 1", "year 2019 was the year before the pandemic", "hey saw hei hei in the …

r text-mining tm stop-words tidytext

asked Apr 22 '22 at 15:20

S Das

3,291
6
26
41

votes

1 answer

Most commonly mentioned countries in the corpus; extracting country names from abstracts R

I have a corpus of a couple of thousand documents and I'm trying to find the most commonly mentioned countries in the abstracts. The library countrycode seems to have a comprehensive list of country names I can match against: # country.name.alt…

r regex string dplyr tidytext

asked Oct 06 '21 at 23:01

QAsena

votes

1 answer

Expand tibble of email dataset in R

I have a massive tibble of my email data which looks like the following: library(dplyr) emails <- tibble( from = c('employee.1@xtra.co','employee.5@xtra.co','employee.1@xtra.co', 'employee.3@xtra.co','employee.1@xtra.co'), to =…

r dplyr tidyr tibble tidytext

asked Mar 01 '21 at 04:42

M.Qasim

1,827
4
33
58

votes

2 answers

`str_replace_all()` on html output (from `huxtable()`)

My R code generates some html output which I'd like to make two very simple "find and replace" type adjustments to: instead of R2 in the html, I'd like to replace with R² intead of [number] *** in the html, I'd like to replace with…

r regex tidyverse stringr tidytext

asked Jun 09 '20 at 03:41

Jeremy K.

1,710
14
35

votes

1 answer

tidytext error (Error in is_corpus_df(corpus) : ncol(corpus) >= 2 is not TRUE)

I am trying to do some basic text analysis. After installing the 'tidytext' package, I tried to unnest my data frame, but I keep getting an error. I assume there is some package I am missing, but I am not sure how to figure out which. Any…

r tidytext

asked May 12 '20 at 19:10

Susan Ray

Prev 1 2

…

19 20 Next