Questions tagged [tidytext]

The tidytext package provides tools for text mining using tidy data principles in R.

The R tidytext package, developed by Julia Silge and David Robinson, provides functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages. When text is in a tidy data structure, tools from the R tidyverse ecosystem like can be used for effective data handling and analysis.

Repositories

Vignettes

Other resources

Related tags

294 questions
0
votes
1 answer

How to correctly remove stop words using tidytext package in R?

I am using stopwords dataset in tidytext package in R to remove stopwords. I am using following code: library(tidyverse) library(tidytext) library(dplyr) data(stop_words) example_words <- c("the", "quick", "brown", "fox", "jumps", "over", "the",…
student_R123
  • 962
  • 11
  • 30
0
votes
1 answer

Extract different hashtags "#" from a text stored in a Dataframe with the R language

I have a data frame with some tweets and i want to extract the hashtags from the tweets using the unnest_tokens() function of tidytext package , creating a tokenized data frame with one row per hashtag. My data only have 3 columns: Fecha: that is a…
0
votes
0 answers

I'm trying to count the number of words in a text but the count function is throwing an error message. I will be grateful for any help. Thanks

library(tidytext) library (dplyr) anfarm %>% unnest_tokens(output = "word", input = "text_column", token = "words") %>% count(word, sort = TRUE) #> Error in UseMethod("count") : #> no applicable method…
Nana
  • 1
  • 1
0
votes
3 answers

Removing specific text R

I have a character vector in a data frame in R which contains inbound email text. Most of the rows contain 'Dear x,' where x is any intended recipient and x can vary. There could also be typos such as the incorrect use of lowercase. Either way, the…
0
votes
1 answer

ggplot sort descending points within group

I want to arrange the plot below so that 'group' is arranged in descending order by 'Distance' within Community (Out, In). I've tried using dplyr::arrange and tidytext::reorder_within(group, -value, MPA_type), but neither of these work - ggplot…
Joshua Smith
  • 125
  • 6
0
votes
2 answers

Passing a vector of characters into another string in R

I would like to know how to pass a vector of text into a string within R. I have a list of emails stored as a character vector: all.emails…
0
votes
0 answers

Rstudio tokenizing multiple documents messy

I am trying to tokenize different documents in Rstudio, but because the documents are really big it gets messy when tokenizing it with 1 word in a row. Is there a solution to keep the tokenized words in 1 row? I first made a corpus and then…
0
votes
2 answers

Is there a way in R to find a combination of words (or sentences) within a certain range in a string

I'm trying to find all strings with a combination of words/sentences with other words separating them but with a fixed limit. Example : I want the combination of "bought" and "watch" but with, at maximum, 2 words separating them. I bought a…
0
votes
2 answers

Extract a 100-Character Window around Keywords in Text Data with R (Quanteda or Tidytext Packages)

This is my first time asking a question on here so I hope I don't miss any crucial parts. I want to perform sentiment analysis on windows of speeches around certain keywords. My dataset is a large csv file containing a number of speeches, but I'm…
kornpat
  • 27
  • 3
0
votes
1 answer

How do I load large (25k and + words) .txt documents to then structure it as one token per row?

How could I load a big folder (more than 100 .txt files) of files for textmining (analysing the most frequent words, their evolution, word clustering and topic, POS, and so) with the TidyText package? I am currently using Silge's & Robinson's "text…
IvanLdF
  • 1
  • 1
0
votes
1 answer

how to unlist a `tknlist`?

step_tokenize returns a vector of type tknlist. How can I get a rectangular for of it? I mean something like unnesting the tokens and add them a cols of the tibble. library(textrecipes) library(modeldata) data(tate_text) tate_rec <- recipe(~., data…
Nip
  • 387
  • 4
  • 11
0
votes
0 answers

Get zero tf_idf from dfm with quanteda r

I want to create a Document-feature matrix with tf_idf as weights. If I calculate the tf_idf like in https://quanteda.io/reference/dfm_tfidf.html, I get only zeros. The same if I try to get tf_idf with tidytext from the same token dataset. Looks to…
padul
  • 134
  • 11
0
votes
0 answers

Restore original data from document term matrix in R

I want to know if there is a way to go back to my original database (df) after I have made it a document term matrix. Here is an example of what I want to do. df <- data.frame(group=c("A","A","B","B","C"), comment = c("hello…
0
votes
1 answer

Errors in counting + combining bing sentiment score variables in Tidytext?

I'm doing sentiment analysis on a large corpus of text. I'm using the bing lexicon in tidytext to get simple binary pos/neg classifications, but want to calculate the ratios of positive to total (positive & negative) words within a document. I'm…
0
votes
2 answers

Tidytext R - find and replace

I have the results from a survey, in which a bunch of anwsers have errors, such as misspellings, UppercAseS/lower cases, ... Therefore, I need something like a find and replace kind of solution (I've found some possible functions but none of them…
Tiago
  • 21
  • 4