Tidy text: Compute Zipf's law from the following term-document matrix

Question

I tried the code from http://tidytextmining.com/tfidf.html. My result can be seen in this image.

My question is: How can I rewrite the code to produce the negative relationship between the term frequency and the rank?

The following is the term-document matrix. Any comments are highly appreciated.

 # Zipf 's law

freq_rk < -DTM_words %>%
group_by(document) %>%
mutate(rank=row_number(),
       'term_frequency'=count/total)

freq_rk %>%
ggplot(aes(rank,term_frequency,color=document)) +
geom_line(size=1.2,alpha=0.8)


DTM_words
 # A tibble: 4,530 x 5
     document       term count     n total
        <chr>      <chr> <dbl> <int> <dbl>
 1        1      activ     1     1   109
 2        1 agencydebt     1     1   109
 3        1     assess     1     1   109
 4        1      avail     1     1   109
 5        1     balanc     2     1   109
 # ... with 4,520 more rows

What do you mean by the negative relationship between the term and the rank? The linear regression of term_frequency predicting rank? Because your plot does not immediately suggest a negative relationship there... It might even be positive. That code would be: lm(formula = rank ~ frequency, data = freq_rk) — Nicolás Velasquez, Aug 05 '17 at 03:52
Thank you Nicolas ...... I mean that the graph that I plotted did not look like the standard Zipf 's law graph. So, I would like to find a code that helps to produce that graph. Thanks — SChatcha, Aug 05 '17 at 07:57
Tom, Zipf's law suggests an inverse relationship. The book you point at even explains it as a power law, that is, a logarithmic relationship. As per your image, your case is described, at best, by a lineal relationship. So, it looks like your case does not fall into Zipf's range. One of the reasons your case might fall out of a logaritmic regression is that you seem to have too few cases (i.e. tokens or words), and they are concentrated in a small window ( your rank is 1 to 30). Look at how large the J. Austen's ranks are, from 1 to ~2500. — Nicolás Velasquez, Aug 05 '17 at 12:39

Julia Silge · Accepted Answer · 2017-08-05T20:50:49.513

To use row_number() to get rank, you need to make sure that your data frame is ordered by n, the number of times a word is used in a document. Let's look at an example. It sounds like you are starting with a document-term matrix that you are tidying? (I'm going to use some example data that is similar to a DTM from quanteda.)

library(tidyverse)
library(tidytext)

data("data_corpus_inaugural", package = "quanteda")
inaug_dfm <- quanteda::dfm(data_corpus_inaugural, verbose = FALSE)

ap_td <- tidy(inaug_dfm)
ap_td
#> # A tibble: 44,725 x 3
#>           document   term count
#>              <chr>  <chr> <dbl>
#>  1 1789-Washington fellow     3
#>  2 1793-Washington fellow     1
#>  3      1797-Adams fellow     3
#>  4  1801-Jefferson fellow     7
#>  5  1805-Jefferson fellow     8
#>  6    1809-Madison fellow     1
#>  7    1813-Madison fellow     1
#>  8     1817-Monroe fellow     6
#>  9     1821-Monroe fellow    10
#> 10      1825-Adams fellow     3
#> # ... with 44,715 more rows

Notice that here, you have a tidy data frame with one word per row, but it is not ordered by count, the number of times that each word was used in each document. If we used row_number() here to try to assign rank, it isn't meaningful because the words are all jumbled up in order.

Instead, we can arrange this by descending count.

ap_td <- tidy(inaug_dfm) %>%
  group_by(document) %>%
  arrange(desc(count)) 

ap_td
#> # A tibble: 44,725 x 3
#> # Groups:   document [58]
#>         document  term count
#>            <chr> <chr> <dbl>
#>  1 1841-Harrison   the   829
#>  2 1841-Harrison    of   604
#>  3     1909-Taft   the   486
#>  4 1841-Harrison     ,   407
#>  5     1845-Polk   the   397
#>  6   1821-Monroe   the   360
#>  7 1889-Harrison   the   360
#>  8 1897-McKinley   the   345
#>  9 1841-Harrison    to   318
#> 10 1881-Garfield   the   317
#> # ... with 44,715 more rows

Now we can use row_number() to get rank, because the data frame is actually ranked/arranged/ordered/sorted/however you want to say it.

ap_td <- tidy(inaug_dfm) %>%
  group_by(document) %>%
  arrange(desc(count)) %>%
  mutate(rank = row_number(),
         total = sum(count),
         `term frequency` = count / total)

ap_td
#> # A tibble: 44,725 x 6
#> # Groups:   document [58]
#>         document  term count  rank total `term frequency`
#>            <chr> <chr> <dbl> <int> <dbl>            <dbl>
#>  1 1841-Harrison   the   829     1  9178       0.09032469
#>  2 1841-Harrison    of   604     2  9178       0.06580954
#>  3     1909-Taft   the   486     1  5844       0.08316222
#>  4 1841-Harrison     ,   407     3  9178       0.04434517
#>  5     1845-Polk   the   397     1  5211       0.07618499
#>  6   1821-Monroe   the   360     1  4898       0.07349939
#>  7 1889-Harrison   the   360     1  4744       0.07588533
#>  8 1897-McKinley   the   345     1  4383       0.07871321
#>  9 1841-Harrison    to   318     4  9178       0.03464807
#> 10 1881-Garfield   the   317     1  3240       0.09783951
#> # ... with 44,715 more rows

ap_td %>%
  ggplot(aes(rank, `term frequency`, color = document)) +
  geom_line(alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10()

Thank you very much Julia! Your elaboration is very clear. Yes, I started with the document term matrix and use tidy() to convert it into the tidy format. Thanksss — SChatcha, Aug 05 '17 at 20:40
Dear Julia .... It works very well. I can reproduce Zipf's law now. :D — SChatcha, Aug 05 '17 at 21:02

score 0 · Answer 2 · answered Aug 05 '17 at 13:02

0

A graph that would describe a linear regression (i.e. not Zipf's Law) would just add a smooth with a linear regression model (lm).

freq_rk %>%
ggplot(aes(rank,term_frequency,color=document)) +
geom_line(size=1.2,alpha=0.8) +
geom_smooth(method = lm)

To identify the differences between Austen's distributions and yours, run the following code:

Austen:

ggplot(freq_by_rank, aes(rank, fill = book) + geom_density(alpha = 0.5) + labs(title = "Austen linear")
ggplot(freq_by_rank, aes(rank, fill = book) + geom_density(alpha = 0.5) + scale_x_log10() + labs(title = "Austen Logarithmic")

Tom's Sample

ggplot(freq_rk, aes(rank, fill = document) + geom_density(alpha = 0.5) + labs(title = "Sample linear")
ggplot(freq_rk, aes(rank, fill = document) + geom_density(alpha = 0.5) + scale_x_log10() + labs(title = "Sample Logarithmic")

answered Aug 05 '17 at 13:02

Nicolás Velasquez

5,623
11
22

Thank you very much Nicolas. I will try replicating and will get back to the forum again. Thank a lots :) – SChatcha Aug 05 '17 at 13:31
Thank you very much for your time and your meaningful comments Nicolas :) – SChatcha Aug 05 '17 at 21:02
I hope it helps, Tom. I am curious, what are your units of analysis? The documents you are analyzing? I am about to begin an analysis of several hundreds of thousends of Facebook posts, and I was wondering what literature on text analysis for social media post is thought nowadays in communication and literature schools. – Nicolás Velasquez Aug 05 '17 at 22:06
1

Hi Nicolas I am working on the central bank statements. I download the statements in csv format, and each row of the file is a statement for one meeting. For example, there are 8 meetings in one year. Thus, my csv file will have 8 rows. As for the application of text mining , I would highly recommend you to visit Ted kwartler : https://github.com/kwartler/ODSC_Workshop and this youtube: https://www.youtube.com/watch?v=GTrkTDCyO80&t=1539s The youtube is the video for ODSC workshop. The example of the workshop is sentiment analysis. Hope this helps :) – SChatcha Aug 06 '17 at 00:15

Tidy text: Compute Zipf's law from the following term-document matrix

2 Answers2