1

I have individual level causes of death data (from the 19th century) and want to compare the frequencies between males and females, either using scatterplots or comparing word clouds. I have manage to do this by using the following command (exemplified for comparing Word clouds):

all=c(female,male)
corpus = Corpus(VectorSource(all))
tdm = TermDocumentMatrix(corpus)
tdm = as.matrix(tdm)
colnames(tdm) = c("female", "male")
comparison.cloud(tdm, max.words=200, random.order=FALSE,rot.per=.0, colors=c("indianred3","lightsteelblue3"), use.r.layout=FALSE,title.size=3)

At some point during this process the causes of deaths are split into single words (they are merged when I read in the data). My question: Is there a way to make word clouds or scatterplots where I take into account that some causes of deaths consist of more than one word? For example: "verval" + "van" + "krachten" does not mean that much separately, but merged together "verval van krachten" is a highly frequent cause of death, with a proper meaning.

Sommerseth
  • 11
  • 2
  • The text gets split into single "tokens" by `tdm = TermDocumentMatrix(corpus)`. You can adjust this to make bigrams (two word entities or trigrams, etc). If you want a mix of single and multiword entities you will probably need to look at something more complicated (LDA or word2vec, possibly). – emilliman5 Apr 04 '18 at 21:30
  • I would double check this line of code `colnames(tdm) = c("female", "male")`. `tdm` should have as many columns as `length(all)` which means you are labeling `tdm` as male/female in an alternating fashion when in fact it should be something like `rep(c("female", "male"), c(length(female), length(male))` – emilliman5 Apr 04 '18 at 21:34
  • Have a look at this blogpost http://www.bnosac.be/index.php/blog/77-an-overview-of-keyword-extraction-techniques which uses the udpipe package to do exactly that. –  Apr 05 '18 at 07:56
  • @jwijffels A good suggestion, though it worked badly on my data. Although the data is cleaned and standardized, it didnt match the language packages well. For example, about 50% of the nouns are adjectives. – Sommerseth Apr 12 '18 at 07:53
  • @emilliman5 Do I understand you right that your suggestion for a mix of single and multiword entities will demand building up my own training set and do maching learning? Sorry, the reading I did on these subjects were quite advanced. – Sommerseth Apr 12 '18 at 07:58
  • Interesting, which language is this and which udpipe language model did you download. Part of speech tagging for each language has accuracy of about 95% (see https://github.com/bnosac/udpipe How good are these models), so 50% seems quite low in your case. Which language model did you use? –  Apr 12 '18 at 08:11
  • I used Norwegian-bokmaal for the Norwegian data, and Dutch for the Dutch one. I think some problem could be related to the mix of latin and Norwegian or latin and Dutch, and even latinised words, since we're dealing with historical causes of Death data. That said, I noticed that several of the adjectives in the noun class are common Norwegian adjectives.. and the Lemmatization did some strange stuff to, like cutting words on odd places – Sommerseth Apr 12 '18 at 08:21
  • I've just opend the github link - did you make the udpipe program? I so want it to work on my data, cause it will contribute to our research in a profound way – Sommerseth Apr 12 '18 at 08:32

0 Answers0