Questions tagged [tidytext]

The tidytext package provides tools for text mining using tidy data principles in R.

The R tidytext package, developed by Julia Silge and David Robinson, provides functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages. When text is in a tidy data structure, tools from the R tidyverse ecosystem like can be used for effective data handling and analysis.

Repositories

Vignettes

Other resources

Related tags

294 questions
2
votes
2 answers

Graph with ordered bars and using facets

I am trying to make a graph with ordered bars according to frequency and also using a variable two separate two variables using facets. Words have to be ordered by value given in 'n' variable. So, my graph should look like this one which appears in…
Tito Sanz
  • 1,280
  • 1
  • 16
  • 33
2
votes
1 answer

Passing `top_n` and `arrange` to ggplot (dplyr)

There is a lovely chunk of code in TidyText Mining Section 3.3 that I am trying to replicate in my own dataset. However, in my data, I cannot get ggplot to 'remember' that I want the data in descending order, and that I want a certain top_n. I can…
JMacKay
  • 45
  • 5
2
votes
4 answers

Using tidytext and broom but not finding tidier for LDA_VEM

The tidytext book has examples with a tidier for topicmodels: library(tidyverse) library(tidytext) library(topicmodels) library(broom) year_word_counts <- tibble(year = c("2007", "2008", "2009"), + word = c("dog", "cat",…
Isaiah
  • 2,091
  • 3
  • 19
  • 28
2
votes
1 answer

unnest_tokens fails to handle vectors in R with tidytext package

I want to use the tidytext package to create a column with 'ngrams'. with the following code: library(tidytext) unnest_tokens(tbl = president_tweets, output = bigrams, input = text, token = "ngrams", …
Tdebeus
  • 1,519
  • 5
  • 21
  • 43
2
votes
1 answer

Using unnest_tokens() to split a column by a specific character?

I'm working with a column of vectors of urls formatted as a string with each url separated by a comma: column_with_urls ["url.a, url.b, url.c"] ["url.d, url.e, url.f"] I would like to use the tidytext::unnest_tokens() R function to separate these…
Josh
  • 1,237
  • 4
  • 15
  • 22
2
votes
1 answer

Remove stop words from data frame

My data is already in a data frame, with one token per line. I'd like to filter out the rows that contain stop words. The dataframe looks like: docID <- c(1,2,2) token <- c('the', 'cat', 'sat') count <- c(10,20,30) df <- data.frame(docID, token,…
Adam_G
  • 7,337
  • 20
  • 86
  • 148
2
votes
1 answer

Web scraping pdf files from HTML

How can I scrape the pdf documents from HTML? I am using R and I can do only extract the text from HTML. The example of the website that I am going to scrape is as…
SChatcha
  • 129
  • 1
  • 3
  • 10
2
votes
1 answer

Adding word count size as a layer to the node size on a cooccurrence network chart using tidytext

I'm interested in using a similar co-occurrence network chart as what is shown on section 8.2.2 David Robinson and Julia Silge's Tidy Text mining book, such as this chart, except that I would like to have the sizes of the nodes change depending on…
Phil
  • 7,287
  • 3
  • 36
  • 66
2
votes
1 answer

tf-idf document term matrix and LDA: Error messages in R

Can we input tf-idf document term matrix into Latent Dirichlet Allocation (LDA)? if yes, how? It does not work in my case and the LDA function requires the 'term-frequency' document term matrix. Thank you (I make a question as concise as possible.…
2
votes
2 answers

Topic Modelling: LDA , word frequency in each topic and Wordcloud

Question: How can I compute and code the frequency of words in each topic? My goal is to create 'Word Cloud' from each topic. P.S.> I have no problem with wordcloud. From the code, burnin <- 4000 #We do not collect this. iter <- 4000 thin…
1
vote
2 answers

how can I unnest phrases between brackets

I have text that I am trying to organizing for some text mining and am using the TidyText library. I have tried setting the token to a regex and setting a custom pattern, but it sends up returning just the bracket (or nothing) and not the content of…
maijuli
  • 23
  • 3
1
vote
1 answer

Is there a convenient way to deal with "stop phrases" when text mining in R?

I am currently working on a large number of judicial documents. They contain a number of fixed phrases (e.g. Council directive) which due to their frequent occurrence have no meaning for my analysis. Therefore, I would like to remove them. Using a…
banannanas
  • 11
  • 2
1
vote
1 answer

Wordcloud2 - separate words for counting

am trying to extract the words so that I can create a wordcloud but have some difficulties this is the code: library(readxl) data <- read_excel("C:\\Users\\me\\OneDrive\\Desktop\\ToPandas.xlsx") data2…
crl6904
  • 11
  • 4
1
vote
1 answer

reorder_within reordering facets in nestedfacet ggplot

Help with reordering facets. I am using reorder_within and scale_x_reordered from Julia Silge's blog (https://juliasilge.com/blog/reorder-within/) I am using nested facets here and reordering facets within a parent facet. In this use case the…
Keelin
  • 367
  • 1
  • 10
1
vote
1 answer

scale_x_reordered does not work in facet_grid

I am a newbie in R and would like to seek your advice regarding visualization using reorder_within, and scale_x_reordered (library: tidytext). I want to show the data (ordered by max to min) by states for each year. This is sample data for…
Kob
  • 147
  • 1
  • 11