0

Hi i have two arrays 'topWords' of length N (unique words), and 'observedWords' with length < N (repetitions of words).

I'd like an array of counts 'countArray' of length N containing the number of times each of the N words in 'topWords' occurs in the array 'observedWords'. What is an efficient way to do this in R?

hearse
  • 379
  • 2
  • 4
  • 23
  • Have you looked at the [tm](http://cran.r-project.org/web/packages/tm/index.html) package? And, if you don't want to use a package, there are plenty of ways. [This](http://johnvictoranderson.org/?p=115) is one of them. – hrbrmstr Mar 21 '14 at 01:01

3 Answers3

1

You could use table and match functions. See example codes below. Not sure whether they are suitable for you.

topWords <- c('A', 'B', 'C')
observedWords <- c(rep('A', 5), rep('B', 4))
count <- table(observedWords)
pos <- match(topWords, names(count))
fre <- as.numeric(count)[pos]
Bangyou
  • 9,462
  • 16
  • 62
  • 94
1

Here's a simple example using match and unique. Then ifelse at the end to turn the NA values into 0.

> topWords <- paste(LETTERS, letters, sep = "")
> topWords
##  [1] "Aa" "Bb" "Cc" "Dd" "Ee" "Ff" "Gg" "Hh" "Ii" "Jj" "Kk" "Ll" "Mm" "Nn" "Oo"
## [16] "Pp" "Qq" "Rr" "Ss" "Tt" "Uu" "Vv" "Ww" "Xx" "Yy" "Zz"
> observedWords <- c("Bb", rep("Mm", 2), rep("Pp", 3))
> observedWords
## [1] "Bb" "Mm" "Mm" "Pp" "Pp" "Pp"
> mm <- match(topWords, unique(observedWords))
> ifelse(is.na(mm), 0, mm)
## [1] 0 1 0 0 0 0 0 0 0 0 0 0 2 0 0 3 0 0 0 0 0 0 0 0 0 0
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
1

Using RScriv's example:

topWords <- paste(LETTERS, letters, sep = "")
observedWords <- c("Bb", rep("Mm", 2), rep("Pp", 3))

library(qdap)
termco(observedWords, match.list=topWords)

##   all word.count Aa        Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll        Mm Nn Oo        Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz
## 1 all          6  0 1(16.67%)  0  0  0  0  0  0  0  0  0  0 2(33.33%)  0  0 3(50.00%)  0  0  0  0  0  0  0  0  0  0

And if you want just the frequencies wrap with the counts method:

counts(termco(observedWords, match.list=topWords))
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519