R function for pattern matching

Question

I am doing a text mining project that will analyze some speeches from the three remaining presidential candidates. I have completed POS tagging with OpenNLP and created a two column data frame with the results. I have added a variable, called pair. Here is a sample from the Clinton data frame:

           V1   V2  pair
1          c(  NN  FALSE
2      "thank VBP  FALSE
3         you PRP  FALSE
4          so  RB  FALSE
5        much  RB  FALSE
6           .   .  FALSE
7          it PRP  FALSE
8          is VBZ  FALSE
9   wonderful  JJ  FALSE
10         to  TO  FALSE
11         be  VB  FALSE
12       here  RB  FALSE
13        and  CC  FALSE
14        see  VB  FALSE
15         so  RB  FALSE
16       many  JJ  FALSE
17    friends NNS  FALSE
18          .   .  FALSE
19        ive  JJ  FALSE
20     spoken VBN  FALSE

What I'm now trying to do is write a function that will iterate through the V2 POS column and evaluate it for specific pattern pairs. (These come from Turney's PMI article.) I'm not yet very knowledgeable when it comes to writing functions, so I'm certain I've done it wrong, but here is what I've got so far.

pairs <- function(x){

  JJ <- "JJ"      #adjectives
  N <- "N[A-Z]"   #any noun form
  R <- "R[A-Z]"   #any adverb form
  V <- "V[A-Z]"   #any verb form

  for(i in 1:(length)(x) {
      if(x == J && x+1 == N) {    #i.e., if the first word = J and the next = N
        pair[i] <- "JJ|NN"     #insert this into the 'pair' variable
      } else if (x == R && x+1 == J && x+2 != N) {
        pair[i] <- "RB|JJ"
      } else if  (x == J && x+1 == J && x+2 != N) {
        pair[i] <- "JJ|JJ"
      } else if (x == N && x+1 == J && x+2 != N) {
        pair[i] <- "NN|JJ"
      } else if (x == R && x+1 == V) {
        pair[i] <- "RB|VB"
         } else {
         pair[i] <- "FALSE"
         }
  }
}

# Run the function
cl.df.pairs <- pairs(cl.df$V2)

There are a number of (truly embarrassing) issues. First, when I try to run the function code, I get two Error: unexpected '}' in " }" errors at the end. I can't figure out why, because they match opening "{". I'm assuming it's because R is expecting something else to be there.

Also, and more importantly, this function won't exactly get me what I want, which is to extract the word pairs that match a pattern and then the pattern that they match. I honestly have no idea how to do that.

Then I need to figure out how to evaluate the semantic orientation of each word combo by comparing the phrases to the pos/neg lexical data sets that I have, but that's a whole other issue. I have the formula from the article, which I'm hoping will point me in the right direction.

I have looked all over and can't find a comparable function in any of the NLP packages, such as OpenNLP, RTextTools, etc. I HAVE looked at other SO questions/answers, like this one and this one, but they haven't worked for me when I've tried to adapt them. I'm fairly certain I'm missing something obvious here, so would appreciate any advice.

EDIT:

Here is the first 20 lines of the Sanders data frame.

head(sa.POS.df, 20)
           V1   V2
1         the   DT
2    american   JJ
3      people  NNS
4         are  VBP
5    catching  VBG
6          on   RB
7           .    .
8        they  PRP
9  understand  VBP
10       that   IN
11  something   NN
12         is  VBZ
13 profoundly   RB
14      wrong   JJ
15       when  WRB
16          ,    ,
17         in   IN
18        our PRP$
19    country   NN
20      today   NN

And I've written the following function:

pairs <- function(x, y) {
  require(gsubfn)
  J <- "JJ"      #adjectives
  N <- "N[A-Z]"   #any noun form
  R <- "R[A-Z]"   #any adverb form
  V <- "V[A-Z]"   #any verb form

  for(i in 1:(length(x))) {
    ngram <- c(x[[i]], x[[i+1]]) 
# the ngram consists of the word on line `i` and the word below line `i`
  }
  strapply(y[i], "(J)\n(N)", FUN = paste(ngram, sep = " "), simplify = TRUE)

  ngrams.df = data.frame(ngrams=ngram)
  return(ngrams.df)
}

So, what is SUPPOSED to happen is that when strapply matches the pattern (in this case, an adjective followed by a noun, it should paste the ngram. And all of the resulting ngrams should populate the ngrams.df.
So I've entered the following function call and get an error:

> sa.JN <- pairs(x=sa.POS.df$V1, y=sa.POS.df$V2)
Error in x[[i + 1]] : subscript out of bounds

I'm only just learning the intricacies of regular expressions, so I'm not quite sure how to get my function to pull the actual adjective and noun. Based on the data shown here, it should pull "american" and "people" and paste them into the data frame.

This is a problem: `for(i in 1:(length)(x)`. It should be `for (i in 1:length(x))` — nrussell, May 18 '16 at 14:37
Thanks, but I'm still getting the "}" errors. Also, when I try to run the function on one of my data sets, I get `Error in pairs.default(cl.df$V2) : only one column in the argument to 'pairs'` — ldlpdx, May 18 '16 at 14:42
The `}` \ `"` error might stem from your data. V1 in the second row is `"thank`. You could try to replace it with `\"thank` and do the same in case of other occurrences. Probably a better solution could be to remove entirely the punctuation in the input data. — RHertel, May 18 '16 at 14:45
Some desired output corresponding to your sample input would go a long way to clarify what you're after. — Gregor Thomas, May 18 '16 at 22:40
Please see my edit, @Gregor, to see what I'm trying to extract from the data frame. — ldlpdx, May 21 '16 at 19:06
I see lots of attempts, and an intermediate result, but I still don't see a sample of the output that you would like to see from the sample input you gave. You give the example of '"big banks"` and `"major corporation"` from one of Bernie's speeches, but your data sample is from a HC speech and doesn't include those words. And in your sample data `pair` is always false. Can you construct a small, nearly trivial, example data set that has some output to illustrate what you're trying to accomplish? — Gregor Thomas, May 22 '16 at 07:17
Hi @Gregor. I've deleted my old edit and added one that should answer the questions that you've posed. I hope. — ldlpdx, May 25 '16 at 18:45
Much better and clearer. I don't have time while I'm at work, but I can probably write a solution this evening if no one gets there first. — Gregor Thomas, May 25 '16 at 19:38
Hi @Gregor. Have you thought of a way that I can get the output I'm looking for? I'm completely at a loss. — ldlpdx, May 27 '16 at 09:38
Yes - but I just got to work again :\. I believe your main problem is the newline bit - unless the `gsubfn` package does things **much** differently than standard R, each element of the vector will be considered as an independent string - including both patterns separated by `\n` will not match consecutive vector elements. — Gregor Thomas, May 27 '16 at 16:57
You need to get the indices of `JJ` and of `NN` matches, and then look for times they occur next to each other. This could be done many ways. If `jj_match` and `nn_match` are boolean vectors of matches, something like this might work: `which(c(jj_match, FALSE) & c(FALSE, nn_match))` — Gregor Thomas, May 27 '16 at 16:59

Erin · Answer 1 · 2016-05-18T22:21:17.193

1

I think the following is the code you wrote, but without throwing errors:

pairs <- function(x) {

  J <- "JJ"      #adjectives
  N <- "N[A-Z]"   #any noun form
  R <- "R[A-Z]"   #any adverb form
  V <- "V[A-Z]"   #any verb form

  pair = rep("FALSE", length(x))
  for(i in 1:(nrow(x)-2)) {
    this.pos = x[i,2]
    next.pos = x[i+1,2]
    next.next.pos = x[i+2,2]
    if(this.pos == J && next.pos == N) {    #i.e., if the first word = J and the next = N
      pair[i] <- "JJ|NN"     #insert this into the 'pair' variable
    } else if (this.pos == R && next.pos == J && next.next.pos != N) {
      pair[i] <- "RB|JJ"
    } else if  (this.pos == J && next.pos == J && next.next.pos != N) {
      pair[i] <- "JJ|JJ"
    } else if (this.pos == N && next.pos == J && next.next.pos != N) {
      pair[i] <- "NN|JJ"
    } else if (this.pos == R && next.pos == V) {
      pair[i] <- "RB|VB"
    } else {
      pair[i] <- "FALSE"
    }
  }

  ## then deal with the last two elements, for which you can't check what's up next

  return(pair)
}

not sure what you mean by this, though:

Also, and more importantly, this function won't exactly get me what I want, which is to extract the word pairs that match a pattern and then the pattern that they match. I honestly have no idea how to do that.

edited May 18 '16 at 22:21

answered May 18 '16 at 14:55

Erin

386
1
7

Thanks, that actually does work without errors. However, what it returns is just a list of logicals for each document it runs through. It doesn't include the words that make up the n-grams. That's what that last part that you couldn't flesh out means. So what I want is the pair designation, i.e., "JJ|NN," AND the n-gram that it matches. Would I just include that in `return(x[i,2], pairs)`? Wait, no, that would only return one word, right? As you can tell, I'm completely lost. – ldlpdx May 18 '16 at 16:08
I tried `return(x[i,2], x[i+1,2], pair)`, but when I ran my data through it, I got `Error in return(x[i, 2], x[i + 1, 2], pair) : multi-argument returns are not permitted`. Seems like I need to ask it to return a data.frame, yes? – ldlpdx May 18 '16 at 16:11
You could have it return a named list or vector e.g. `c(ngram=ngrams, pair=pairs)` and then make a `data.frame(pair(x))` of the result. I realized there's another issue with your code, which is the `this.pos==J` logic is not quite right. You want do a regex match probably, e.g. with `grep`. And I corrected `length(x)` to `nrow(x)`. – Erin May 18 '16 at 22:21
I could not get the regex match to work within a function @Erin, so I just did each match manually. If you look at my *EDIT* above, you will see what I'm grappling with now. Any thoughts would be appreciated. – ldlpdx May 21 '16 at 19:04

score 1 · Accepted Answer · answered May 28 '16 at 06:57

Okay, here we go. Using this data (shared nicely with dput()):

df = structure(list(V1 = structure(c(15L, 3L, 11L, 4L, 5L, 9L, 2L, 
16L, 18L, 14L, 13L, 8L, 12L, 20L, 19L, 1L, 7L, 10L, 6L, 17L), .Label = c(",", 
".", "american", "are", "catching", "country", "in", "is", "on", 
"our", "people", "profoundly", "something", "that", "the", "they", 
"today", "understand", "when", "wrong"), class = "factor"), V2 = structure(c(3L, 
5L, 7L, 12L, 11L, 10L, 2L, 8L, 12L, 4L, 6L, 13L, 10L, 5L, 14L, 
1L, 4L, 9L, 6L, 6L), .Label = c(",", ".", "DT", "IN", "JJ", "NN", 
"NNS", "PRP", "PRP$", "RB", "VBG", "VBP", "VBZ", "WRB"), class = "factor")), .Names = c("V1", 
"V2"), class = "data.frame", row.names = c("1", "2", "3", "4", 
"5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", 
"16", "17", "18", "19", "20"))

I'll use the stringr package because of its consistent syntax so I don't have to look up the argument order for grep. We'll first detect the adjectives, then the nouns, and figure out where the line up (offsetting by 1). Then paste the words together that correspond to the matches.

library(stringr)
adj = str_detect(df$V2, "JJ")
noun = str_detect(df$V2, "NN")

pairs = which(c(FALSE, adj) & c(noun, FALSE))

ngram = paste(df$V1[pairs - 1], df$V1[pairs])
# [1] "american people"

Now we can put it in a function. I left the patterns as arguments (with adjective, noun as the defaults) for flexibility.

bigram = function(word, type, patt1 = "JJ", patt2 = "N[A-Z]") {
    pairs = which(c(FALSE, str_detect(type, pattern = patt1)) &
                      c(str_detect(type, patt2), FALSE))
    return(paste(word[pairs - 1], word[pairs]))
}

Demonstrating use on the original data

with(df, bigram(word = V1, type = V2))
# [1] "american people"

Let's cook up some data with more than one match to make sure it works:

df2 = data.frame(w = c("american", "people", "hate", "a", "big", "bad",  "bank"),
                 t = c("JJ", "NNS", "VBP", "DT", "JJ", "JJ", "NN"))
df2
#          w   t
# 1 american  JJ
# 2   people NNS
# 3     hate VBP
# 4        a  DT
# 5      big  JJ
# 6      bad  JJ
# 7     bank  NN

with(df2, bigram(word = w, type = t))
# [1] "american people" "bad bank"

And back to the original to test out a different pattern:

with(df, bigram(word = V1, type = V2, patt1 = "N[A-Z]", patt2 = "V[A-Z]"))
# [1] "people are"   "something is"

Oh, man! Thank you SO VERY much @Gregor! Totally generous, totally appreciated. A couple of questions: in `patt1` you have `c(FALSE, ...)`, but in `patt2` you've put `c(....), FALSE)`. Why is `FALSE` at the beginning of `patt1` and at the end of `patt2`? I see this a lot, but when I looked up `?str_detect`, I didn't see what that `FALSE` was accomplishing, and I do want to learn this, not just rely on others to give me code. — ldlpdx, May 29 '16 at 18:08
My other question @Gregor: Also, I've tried to adjust the function to allow that some of the word strings should NOT be followed by an `NN`. I tried using `patt3 != "N[A-Z]` and I tried `patt3 = !"N[A-Z]`. I got an error both times that they were `invalid argument type`, which confused me because isn't `!=` the way you're supposed to do it? So then, I just tried running it with `patt3 = "N[A-Z]` to see what would happen and got the following warning: `longer object length is not a multiplier of shorter object length`. It's not super crucial that I assess these, so don't spend a lot of effort. — ldlpdx, May 29 '16 at 18:09
`str_detect` is returning boolean vectors (true and false). I add an extra false to the beginning of the pattern 1 result vector to offset it --- so it will line up with the vector indexes of the pattern 2 vector. Similarly, I add a false to the end of the pattern 2 vector so that the two vectors still have the same length. — Gregor Thomas, May 29 '16 at 19:53
Look carefully and you'll see that the `FALSE`s are *outside* the `str_detect()` calls - they're being appended to the results. — Gregor Thomas, May 29 '16 at 19:53
`patt1` and `patt2` are just arguments that get passed directly to `str_detect`. `bigram(... patt1 != "N[A-Z]")` is nonsense, it would be like calling `mean(x != 2)`. And there is no argument named `patt3`, so passing that in is just wishful thinking, like `mean(1:5, r_please_exclude_5_from_the_mean = TRUE)`. — Gregor Thomas, May 29 '16 at 20:03
What you need to do is read up a little on regular expressions and make a regular expression that meets your conditions. For example, *not* NN would be `"[^NN]"` in regular expression, so adjective followed by anything but a noun would be `bigram(V1, V2, patt1 = "JJ", patt2 = "[^NN]")`, or something like that. And test! Run `cbind(yourdata$V2, str_detect(yourdata$V2, pattern = "[^NN]"))` to debug and make sure the things you want to be picked up are getting picked up. — Gregor Thomas, May 29 '16 at 20:04

R function for pattern matching

2 Answers2