Text Mining with Tidytext: problems pairwise_count and pairwise_cor

Question

I'm experimenting with Tidytext (Text Mining with R) and I want to use the functions pairwise_count and pairwise_cor from the widyr library. My corpus is from a per-processed text file.

library(readr)
library(dplyr)
library(tidytext)
library(widyr)

set.seed(2017)

Korpus <- read_file("/home/knecht/korpus.res")
print(Korpus)

Korpus_DF <-data_frame(document= 1, text=Korpus)

spon_words <- Korpus_DF %>%
  unnest_tokens(word, text)
print(spon_words)

spon_words %>%
  count(word, sort=TRUE)

word_cors <- spon_words %>%
  group_by(word) %>%
 filter(n()>= 10) %>%
  pairwise_cor(word, document, sort = TRUE, upper= FALSE)
word_cors

pair_test <- spon_words %>%
  pairwise_count(word, document)
print(pair_test)

I think, I don't get a correct result, because the corpus contains multiply phrases like "spiegel online" or "spiegel plus" phrases but these are not appear in the result table:

> library(readr)

> library(dplyr)

> library(tidytext)

> library(widyr)

> set.seed(2017)

> Korpus <- read_file("/home/knecht/korpus.res")

> print(Korpus)
[1] "29.12.2017 17:24:57 Results of ResultWriter 'Write as Text' [1]: \n29.12.2017 17:24:57 SimpleExampleSet:\n1 examples,\n0 regular attributes,\nspecial attributes = {\n    text = #0: text (text/single_value)/values=[SPIEGEL ONLINE Aktuelle Nachrichten Nachrichten SPIEGEL ONLINE Mein SPIEGEL 29. Dezember 2017 TV-Programm Wetter Schlagzeilen Themenwochen Wahl Verbraucher Service Unternehmen Märkte Staat Soziales LOTTO 6aus49 Spielerindex SPIX Champions League Formel Bundesliga präsentiert von Continental Uno-Klimakonferenz 2017 Diagnose Therapie Ernährung Fitness Sex Partnerschaft Schwangerschaft Kind Erster Weltkrieg Zweiter Weltkrieg Leben und Lernen Deals der Woche IAA 2017 Front Page SPIEGEL Plus SPIEGEL Plus Deutschland SPIEGEL Plus Wirtschaft SPIEGEL Plus Gesellschaft SPIEGEL Plus Ausland SPIEGEL Plus Sport SPIEGEL Plus Wissenschaft SPIEGEL Plus Kultur SPIEGEL AKADEMIE DER SPIEGEL live DER SPIEGEL DER SPIEGEL digitales Magazin Titelbilder Heftarchive SPIEGEL SPIEGEL Magazin SPIE... <truncated>

> Korpus_DF <-data_frame(document= 1, text=Korpus)

> spon_words <- Korpus_DF %>%
+   unnest_tokens(word, text)

> print(spon_words)
# A tibble: 3,267 x 2
   document         word
      <dbl>        <chr>
 1        1   29.12.2017
 2        1           17
 3        1           24
 4        1           57
 5        1      results
 6        1           of
 7        1 resultwriter
 8        1        write
 9        1           as
10        1         text
# ... with 3,257 more rows

> spon_words %>%
+   count(word, sort=TRUE)
# A tibble: 1,645 x 2
      word     n
     <chr> <int>
 1    mehr    84
 2     die    78
 3     und    75
 4     der    63
 5 spiegel    58
 6     von    35
 7     sie    32
 8     das    31
 9     ein    31
10     für    31
# ... with 1,635 more rows

> word_cors <- spon_words %>%
+   group_by(word) %>%
+  filter(n()>= 10) %>%
+   pairwise_cor(word, document, sort = TRUE, upper= FALSE)

> word_cors
# A tibble: 561 x 3
     item1  item2 correlation
     <chr>  <chr>       <dbl>
 1 spiegel online         NaN
 2 spiegel   2017         NaN
 3  online   2017         NaN
 4 spiegel    von         NaN
 5  online    von         NaN
 6    2017    von         NaN
 7 spiegel    und         NaN
 8  online    und         NaN
 9    2017    und         NaN
10     von    und         NaN
# ... with 551 more rows

> pair_test <- spon_words %>%
+   pairwise_count(word, document)

> print(pair_test)
# A tibble: 2,704,380 x 3
          item1      item2     n
          <chr>      <chr> <dbl>
 1           17 29.12.2017     1
 2           24 29.12.2017     1
 3           57 29.12.2017     1
 4      results 29.12.2017     1
 5           of 29.12.2017     1
 6 resultwriter 29.12.2017     1
 7        write 29.12.2017     1
 8           as 29.12.2017     1
 9         text 29.12.2017     1
10            1 29.12.2017     1
# ... with 2,704,370 more rows

Is here maybe someone, who can give me a hint, please?

regards Tobias

Try to count bigrams like `spiegel online`, then check whether the result is consistent with your pairwise count. — Scipione Sarlo, Dec 29 '17 at 18:50
When I count bigrams then in the result are 2072 examples with phrases like "mehr Artikel" = 17 or "spiegel plus" = 12 — Tobias Nehrig, Jan 02 '18 at 15:50

score 6 · Answer 1 · answered Jan 01 '18 at 22:23

I notice here that you have the same value for document for all your words, which makes counting up pairs of words or trying to calculate correlations not very meaningful.

Here's an example to show you what I mean. Let's take the novels of Jane Austen and set up a tidy data frame with two columns, one called document that always has the value of 1 the way yours does and one called section that breaks the text up into chunks.

library(dplyr)
library(janeaustenr)
library(tidytext)
library(widyr)

austen_section_words <- austen_books() %>%
    filter(book == "Pride & Prejudice") %>%
    mutate(section = row_number() %/% 10,
           document = 1) %>%
    filter(section > 0) %>%
    unnest_tokens(word, text) %>%
    filter(!word %in% stop_words$word)

austen_section_words
#> # A tibble: 37,240 x 4
#>    book              section document word        
#>    <fctr>              <dbl>    <dbl> <chr>       
#>  1 Pride & Prejudice    1.00     1.00 truth       
#>  2 Pride & Prejudice    1.00     1.00 universally 
#>  3 Pride & Prejudice    1.00     1.00 acknowledged
#>  4 Pride & Prejudice    1.00     1.00 single      
#>  5 Pride & Prejudice    1.00     1.00 possession  
#>  6 Pride & Prejudice    1.00     1.00 fortune     
#>  7 Pride & Prejudice    1.00     1.00 wife        
#>  8 Pride & Prejudice    1.00     1.00 feelings    
#>  9 Pride & Prejudice    1.00     1.00 views       
#> 10 Pride & Prejudice    1.00     1.00 entering    
#> # ... with 37,230 more rows

Both of these columns that a value of 1 at the beginning, but section goes on to have many other values while document stays just 1. If we try to compare the sets of words using widyr::pairwise_count() or widyr::pairwise_cor(), we get very different results using these two columns. In the first case, we are asking, "How often are these words used together in the sections I defined?" In the second case, we are asking, "How often are these words used together in the whole document?" The answer to that is by definition 1, for all words.

word_pairs <- austen_section_words %>%
    pairwise_count(word, section, sort = TRUE)

word_pairs
#> # A tibble: 796,008 x 3
#>    item1     item2         n
#>    <chr>     <chr>     <dbl>
#>  1 darcy     elizabeth 144  
#>  2 elizabeth darcy     144  
#>  3 miss      elizabeth 110  
#>  4 elizabeth miss      110  
#>  5 elizabeth jane      106  
#>  6 jane      elizabeth 106  
#>  7 miss      darcy      92.0
#>  8 darcy     miss       92.0
#>  9 elizabeth bingley    91.0
#> 10 bingley   elizabeth  91.0
#> # ... with 795,998 more rows

word_pairs <- austen_section_words %>%
    pairwise_count(word, document, sort = TRUE)

word_pairs
#> # A tibble: 36,078,042 x 3
#>    item1         item2     n
#>    <chr>         <chr> <dbl>
#>  1 universally   truth  1.00
#>  2 acknowledged  truth  1.00
#>  3 single        truth  1.00
#>  4 possession    truth  1.00
#>  5 fortune       truth  1.00
#>  6 wife          truth  1.00
#>  7 feelings      truth  1.00
#>  8 views         truth  1.00
#>  9 entering      truth  1.00
#> 10 neighbourhood truth  1.00
#> # ... with 36,078,032 more rows

So I think you need to step back and rethink what analytical question you are trying to answer. Do you want to identify bigrams? Are you trying to see which words are used more often near each other? You'll need to change your approach depending on where you are trying to end up.

I've 2000 articles from a web page and try to find co-occurrence in each page and draw them as co-occurrence graphs. In the beginning I'd like to test on only one page. My input table has the columns document and text and one row with 1 for the document number and in the text cell is the content. So I thought I can use pairwise_count to count how often are words used together in each document. In the end my input table should have the 2 columns and 2000 rows. — Tobias Nehrig, Jan 02 '18 at 19:40
Since what you are doing is counting how often words appear together in documents, I would test on only, say, 10 articles, not just 1 (before moving to all 2000 articles). It doesn't make sense to only do it on 1. — Julia Silge, Jan 04 '18 at 22:43
Well, my task is to create for each page of the 2000 a co-occurence graph and after that I've to compare them. — Tobias Nehrig, Jan 07 '18 at 16:52
@JuliaSilge is there a possibility to get item1 and item2 and cor with group? — Rana Usman, Dec 16 '19 at 10:23
@JuliaSilge, related problem in my retail price data (finding #instances of cereal SKUs being at same price in different store locations in a chain) ..to translate to the Jane Austen framing..I need to determine pairwise occurrences of words (cereal SKUs) within book sections (price points in one store) across different Jane Austen books (stores). I am unable to figure out how to define the "feature" appropriately in the pairwise_count function - because words that occur together in section 1 of Pride and Prejudice and in section 101 of Mansfield Park count as co-occurrences. — user3088463, Nov 10 '20 at 21:37

Text Mining with Tidytext: problems pairwise_count and pairwise_cor

1 Answers1