Preserve Hyphenated words in ngrams analysis with tidytext

Question

I am doing text analysis of biograms. I want to preserve "complex" words made of many "simple" words linked by hyphens.

for example, if I have the following vector:

Example<- c("bovine retention-of-placenta sulpha-trimethoprim mineral-vitamin-liquid-mixture)

***I edited this section to make my the output I need clearer****

I want my biograms in a data.frame of dimensions 3x1 (which is the format you obtain when using unnest_tokens from tidytext :


1 bovine                   retention-of-placenta
2 retention-of-placenta    sulpha-trimethoprim
3 sulpha-trimethoprim      mineral-vitamin-liquid-mixture

**** end of the edition****

My problem is that with tidytext, the option token gets used with either "ngrams" (which is the sort of analysis I am performing) or with "regex" (which is the command I may use to condition on these hyphens)

This is the code I am using at the moment:

spdiag_bigrams<-diagnostics%>%unnest_tokens(bigram, text, token = "ngrams", n = 2)

How can I do both things at the same time?

thank you

score 3 · Answer 1 · answered Nov 02 '19 at 00:24

It's true that tidytext strips most punctuation by default, but it does not strip underscores:

library(tidyverse) 
library(tidytext)

example <- c("bovine retention-of-placenta sulpha-trimethoprim mineral-vitamin-liquid-mixture")

tibble(text = example) %>% 
  mutate(text = str_replace_all(text, "-", "_")) %>%
  unnest_tokens(word, text)
#> # A tibble: 4 x 1
#>   word                          
#>   <chr>                         
#> 1 bovine                        
#> 2 retention_of_placenta         
#> 3 sulpha_trimethoprim           
#> 4 mineral_vitamin_liquid_mixture

^{Created on 2019-11-01 by the reprex package (v0.3.0)}

Sometimes I take this approach for multi-word tokens, or if you want to analyze punctuation along with words, check out the strip_punct = FALSE option that is available.

phiver · Accepted Answer · 2019-10-09T09:19:28.970

unnest_tokens cleans the punctuations. This removes the hyphens between the complex words.

You can either use quanteda or tm for this as these packages do not by default remove the punctuation. The below examples are assuming that you have a data.frame and are working with a corpus. But quanteda's tokens function can work directly on text columns.

example <- c("bovine retention-of-placenta sulpha-trimethoprim mineral-vitamin-liquid-mixture")
diagnostics <- data.frame(text = example, stringsAsFactors = FALSE)

with quanteda:

library(quanteda)

qcorp <- corpus(diagnostics)

bigrams <- tokens_ngrams(tokens(qcorp), n = 2, concatenator = " ")
qdfm <- dfm(bigrams)
convert(qdfm, "data.frame")

  document bovine retention-of-placenta retention-of-placenta sulpha-trimethoprim sulpha-trimethoprim mineral-vitamin-liquid-mixture
1    text1                            1                                         1                                                  1

just quanteda's tokens_ngrams using the example vector:

tokens_ngrams(tokens(example), n = 2, concatenator = " ")
tokens from 1 document.
text1 :
[1] "bovine retention-of-placenta"                       "retention-of-placenta sulpha-trimethoprim"         
[3] "sulpha-trimethoprim mineral-vitamin-liquid-mixture"

Edit:

To get a vector of your terms, you could use one of the other convert options and use the $vocab to get the terms.

convert(qdfm, "lda")$vocab
[1] "bovine retention-of-placenta"                       "retention-of-placenta sulpha-trimethoprim"         
[3] "sulpha-trimethoprim mineral-vitamin-liquid-mixture"

Tidy data.frame:

tidytext has a tidy function to transform data from diverse packages into a tidy form. Both quanteda and tm are included. So after getting the data into a dfm, you can use tidy to get the data into a tibble. After that remove all columns you are not interested in with the usual dplyr syntax.

tidy(qdfm)

# A tibble: 3 x 3
  document term                                               count
  <chr>    <chr>                                              <dbl>
1 text1    bovine retention-of-placenta                           1
2 text1    retention-of-placenta sulpha-trimethoprim              1
3 text1    sulpha-trimethoprim mineral-vitamin-liquid-mixture     1

end edit:

with tm:

library(tm)
NLPBigramTokenizer <- function(x) {
  unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}

corp <- VCorpus(VectorSource(example))

dtm <- DocumentTermMatrix(corp, control=list(tokenize = NLPBigramTokenizer))
inspect(dtm)

<<DocumentTermMatrix (documents: 1, terms: 3)>>
Non-/sparse entries: 3/0
Sparsity           : 0%
Maximal term length: 50
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs bovine retention-of-placenta retention-of-placenta sulpha-trimethoprim sulpha-trimethoprim mineral-vitamin-liquid-mixture
   1                            1                                         1                                                  1

This is almost the response I need. I will edit my question to clarify what's the final output I need. Briefly, I would like to get the same output format you get when using tidytext, that is, a dataframe that contains ALL the biograms (one per row) The ```quanteda``` function ```dfm``` groups all biograms of the same tipe, which is not useful for my analysis. — JPV, Oct 09 '19 at 07:28

Preserve Hyphenated words in ngrams analysis with tidytext

2 Answers2