Questions tagged [tidytext]

The tidytext package provides tools for text mining using tidy data principles in R.

The R tidytext package, developed by Julia Silge and David Robinson, provides functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages. When text is in a tidy data structure, tools from the R tidyverse ecosystem like can be used for effective data handling and analysis.

Repositories

Vignettes

Other resources

Related tags

294 questions
1
vote
1 answer

r: unnest_tokens() not working with particular file

i am trying to run unnest_tokens() on the essay4 column of this dataset: https://github.com/rudeboybert/JSE_OkCupid/blob/master/profiles.csv.zip i have tried both unnest_tokens() and unnest_tokens_(), as well as running dput(as_tibble()) on…
qwerty
  • 213
  • 1
  • 7
1
vote
2 answers

Some help getting started with tidytext

I have a project I'm working on in tidytext, which I'm pretty new to. My input data is currently in the form of individual .txt files in a folder. I successfully used get_sentiments() to track the positive/negative sentiments of my data, but I'm…
1
vote
1 answer

Error while using unnest_tokens() while passing a function to the token

Error in unnest_tokens.data.frame(., entity, text, token = tokenize_scispacy_entities, : Expected output of tokenizing function to be a list of length 100 The unnest_tokens() works well for a sample of few observations but fails on the entire…
Sagar K
  • 11
  • 3
1
vote
1 answer

No applicable method for 'tidy' applied to an object of class "factor" in Tidytext

I'm starting doing text mining in R and I've some problems. I have a csv with users comments about a page. Each row is a different comment. It only has 1 column, the one that has the comments. I was trying to use Tidy in R so I import the file…
Pablo
  • 140
  • 1
  • 11
1
vote
1 answer

Count only alphanumeric characters in a string

Given the string "This has 4 words!" I would like to count only the letters and digits. I would like to exclude whitespace and punctuation. As such, the string above should return 13. I'm not sure why, but I cannot get this for R.
Adam_G
  • 7,337
  • 20
  • 86
  • 148
1
vote
1 answer

Join tokens back to sentence

I am doing some text analysis with some free text data with tidytext. Consider a sample sentences: "The quick brown fox jumps over the lazy dog" "I love books" My token approach using tidytext: unigrams = tweet_text %>% unnest_tokens(output =…
macworthy
  • 95
  • 8
1
vote
1 answer

Why do I get dependency-error trying to install package "tidytext" in RStudio

I tried to install tidytext package and received below dependency-ERROR. Please help. ERROR: dependency ‘ISOcodes’ is not available for package ‘stopwords’ ERROR: dependency ‘stopwords’ is not available for package ‘tidytext’
Curi0us
  • 11
  • 1
1
vote
1 answer

tidytext: Issue with unnest_tokens and token = 'ngrams'

I'm running the following code library(rwhatsapp) library(tidytext) chat <- rwa_read(x = c( "31/1/15 04:10:59 - Menganito: Was it good?", "31/1/15 14:10:59 - Fulanito: Yes, it was" )) chat %>% as_tibble() %>% unnest_tokens(output = bigram,…
piblo95
  • 123
  • 1
  • 2
  • 10
1
vote
1 answer

Loop over list in R, conduct analysis specific to element in list, save results in element dataframe?

I am trying to replicate an analysis using tidytext in R, except using a loop. The specific example comes from Julia Silge and David Robinson's Text Mining with R, a Tidy Approach. The context for it can be found here:…
1
vote
1 answer

How to do tokenizing by n-gram for pdf file in R

I want to tokenize a pdf document by ngrams in R. I tried to follow the instructions here at https://www.tidytextmining.com/ngrams.html, but get stuck with the unnest_tokens()…
dss333
  • 71
  • 1
  • 2
  • 7
1
vote
1 answer

Trying to extract a subset of pages from each pdf in a directory with 70 pdf files

I am using tidyverse, tidytext, and pdftools. I want to parse words in a directory of 70 pdf files. I am using these tools to do this successfully but the code below grabs all the pages instead of the subset I want. I need to skip the first two…
1
vote
1 answer

Find documents that include one of a list of words in R

I have two dataframes: msnbc contains a column of news transcripts called text and dictionary contains a column of words called search. I want to return a new dataframe that includes all rows of msnbc where the text field contains one or more words…
James Martherus
  • 1,033
  • 1
  • 9
  • 20
1
vote
1 answer

How to add words manually to nrc sentiment lexicon?

I plan on using the nrc sentiment lexicon with twitter but I realize that there are many words missing. Can anybody guide me on how to add some words with their specific sentiment on R? (I have downloaded the nrc to my environment and also have…
1
vote
0 answers

Filter the top 20% of an if_tdf dtm by group

I have a text with different classes. My goal is to determine and keep only the features with the highest tf_idf value (top 20%) of each class. As an example, I use the book_of_mormon data set. text is the text and book_title is the class. An idea…
Banjo
  • 1,191
  • 1
  • 11
  • 28
1
vote
2 answers

How to Combine Multiple Rows Into One Using TidyText

I am looking at a novel and want to search for the appearance of characters' names throughout the book Some characters go by different names. For example, the character "Sissy Jupe" goes by "Sissy" and "Jupe". I want to combine two rows of word…