2

textstat_keyness in Quanteda is used to compare the relative frequency of WORDS/LEMMAS in two (sub)corpora. But I want to compare parts of speech--not words. Then I want to plot it. I've been able to use textstat_keyness for words, no problem, using the following:

# compare subcorpusA v subcorpusB terms using grouping
genre <- ifelse(docvars(corpusAB, "genre") == "group", "group", "group2")
dfmat3 <- dfm(corpusAB, groups = genre)
head(tstat1 <- textstat_keyness(dfmat3, measure = "lr", sort = TRUE, correction =  "williams"), 20)
tail(tstat1, 20)
head(dfmat3)
textplot_keyness(tstat1, show_reference = TRUE, 
                 show_legend = TRUE, 
                 n = 40, 
                 min_count = 5, margin = 0.05,
                 color = c("darkblue", "gray")
                 , labelcolor = "gray30",
                 labelsize = 2,
                 font = NULL)

I've also tokenized the corpus using tokens(), and I've parsed using spacy_parse. But I can't figure out how to connect the two. Is there a way to tell Quanteda to run textstat_keyness on POS instead of words?

dfayers
  • 35
  • 4

1 Answers1

0

For this you will need to tag the POS, and then treat the POS as a token. This is easy with the spacyr package, which integrates nicely with quanteda.

library("quanteda")
## Package version: 1.5.1
## Parallel computing: 2 of 12 threads used.

library("spacyr")

# parse just Obama 2013 and Trump 2017
corp <- data_corpus_inaugural %>%
  corpus_subset(Year > 2012)
stoks <- spacy_parse(corp)
## Found 'spacy_condaenv'. spacyr will use this environment
## successfully initialized (spaCy Version: 2.1.4, language model: en)
## (python options: type = "condaenv", value = "spacy_condaenv")
head(stoks)
##       doc_id sentence_id token_id     token     lemma   pos   entity
## 1 2013-Obama           1        1      Vice      Vice PROPN         
## 2 2013-Obama           1        2 President President PROPN         
## 3 2013-Obama           1        3     Biden     Biden PROPN PERSON_B
## 4 2013-Obama           1        4         ,         , PUNCT         
## 5 2013-Obama           1        5       Mr.       Mr. PROPN         
## 6 2013-Obama           1        6     Chief     Chief PROPN

That's just a data.frame, so we can replace the token with the POS column.

# replace token with its POS tag
stoks$token <- stoks$pos

# convert to quanteda tokens and build dfm
qtoks <- as.tokens(stoks)
lapply(qtoks, head)
## $`2013-Obama`
## [1] "PROPN" "PROPN" "PROPN" "PUNCT" "PROPN" "PROPN"
## 
## $`2017-Trump`
## [1] "PROPN" "PROPN" "PROPN" "PUNCT" "PROPN" "PROPN"

Now, computing keyness on the POS is straightforward.

# build dfm and test keyness
dfm(qtoks, tolower = FALSE) %>%
  textstat_keyness()
##    feature          chi2            p n_target n_reference
## 1      ADP   6.387342458 1.149370e-02      283         164
## 2     NOUN   5.476967163 1.926866e-02      480         301
## 3      DET   2.817500159 9.324152e-02      325         207
## 4     PRON   1.892097227 1.689656e-01      144          88
## 5      AUX   1.816335501 1.777501e-01        7           1
## 6     PART   0.292103144 5.888759e-01       45          29
## 7     VERB   0.009290769 9.232119e-01      392         285
## 8      ADJ  -0.016739029 8.970574e-01      133          99
## 9    CCONJ  -0.565423892 4.520831e-01      116          94
## 10     ADV  -0.710485982 3.992825e-01      134         109
## 11    INTJ  -0.878721837 3.485520e-01        0           2
## 12     NUM  -1.367004189 2.423273e-01       10          12
## 13   PROPN  -7.214619688 7.231213e-03       56          66
## 14   PUNCT  -7.990735309 4.701732e-03      228         215
## 15   SPACE -36.058636043 1.914683e-09       28          71

Patterns? Obama (the target) used more "adpositions" (prepositions and postpositions), Trump used more spaces. (Space Force - go figure.)

Ken Benoit
  • 14,454
  • 27
  • 50
  • Thanks for the help. It took a while, but I now see what you did. The inaugural address data includes one text per President. My data set is different, which is causing me some confusion. I have to corpora, A and B. Each corpus (a) represents a specific genre and (b) has over 600 short texts. – dfayers Nov 01 '19 at 17:02
  • When I run space_parse, the docvars go away. So quanteda doesn't know what to consider the target or reference. The doc_id structure is like this: genreA.csv.1, genreA.csv.2 ... genreB.csv.1, genreB.csv.2, ... Should I consolidate each genre into a single text? Or is there a better way to tell quanteda how to distinguish between target and reference corpora? Thanks again. – dfayers Nov 01 '19 at 17:11
  • This was not in the question, and still not entirely clear. May I suggest you open a new question with a reproducible, simple-as-possible representation of your problem? – Ken Benoit Nov 01 '19 at 18:09