1

I have a string with a sentence parse and want to extract/parse from the string that are contained within the opening and closing brackets. The catch is that there are other brackets of the same type (parenthesis in this case) that also need to be grabbed. So basically I need to have the correct number of open braces associated with NP equal to the same number of closing braces.

In this example:

x <- "(TOP (S (NP (NNP Clifford)) (NP (DT the) (JJ big) (JJ red) (NN dog)) (VP (VBD ate) (NP (PRP$ my) (NN lunch)))(. .)))"

Let's say I want to extract the noun phrases (NP) into the three substrings below:

(NP (NNP Clifford))
(NP (DT the) (JJ big) (JJ red) (NN dog))
(NP (PRP$ my) (NN lunch))

This would then be generalizable to all parts of the string, say I wanted to grab the VP brackets, I could follow the same logic.

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • How deep of nesting? – hwnd Oct 02 '15 at 00:51
  • I don't believe you can have a NP within another NP. But the parenthesis could in theory be infinitely nested but I'm guessing in reality 2-3 levels deep. The VP example is an example of nesting: `(VP (VBD ate) (NP (PRP$ my) (NN lunch)))` – Tyler Rinker Oct 02 '15 at 00:57
  • I deleted my solution as it did not count parantheses. But did it solve the problem you're facing? – Pierre L Oct 02 '15 at 01:01
  • 1
    @PierreLafortune I upvoted your solution. I wouldn't ave deleted it. – Tyler Rinker Oct 02 '15 at 01:02
  • Well if it helps. Let me know. The other user correctly pointed out that it does not count parantheses as requested. But if it solves the problem, you may not have to count them. – Pierre L Oct 02 '15 at 01:04

3 Answers3

1

The language of balanced parentheses is not regular, so it cannot be matched with basic regular expressions. You could do this with recursive regular expressions (for which see hwnd's answer), but I don't recommend it as the syntax gets rather ugly. Instead, build a parser out of simpler regular expressions, variables, and program control flow. Something like this:

for each character:
    if it's a (, increment the nesting depth.
    if it's a ), decrement the nesting depth.
    if the nesting depth is exactly zero, we've reached the end of this expression.

Alternatively, use a library like openNLP which is already capable of doing this parsing for you.

Kevin
  • 28,963
  • 9
  • 62
  • 81
  • Could you show how I could use `openNLP` to pull these phrases out automatically. Somehow the balance of parenthesis must be parsable as if I don't have enough braces in my R script it knows and expects a closing braces. Syntax highlighters that know when you don't have a closing brace must do this task as well. – Tyler Rinker Oct 02 '15 at 00:36
  • @TylerRinker: It is entirely possible to parse. It's just not (reasonably) possible to do so *with a regular expression*. openNLP [has code for producing these parse trees in the first place](http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.parser) as well as [manipulating them](http://opennlp.apache.org/documentation/1.6.0/apidocs/opennlp-tools/opennlp/tools/parser/Parse.html), but I can't find documentation for the R bindings. They are likely very similar to the Java interface, though. – Kevin Oct 02 '15 at 00:41
  • Yeah I like the Python parsing and was hoping to get this sort of ability in R. – Tyler Rinker Oct 02 '15 at 00:52
1

I'm not sure if the substring is always going to be defined, but in this case you could do:

regmatches(x, 
    gregexpr('(?x)
              (?=\\(NP)           # assert that subpattern precedes
                (                 # start of group 1
                \\(               # match open parenthesis
                    (?:           # start grouping construct
                        [^()]++   # one or more non-parenthesis (possessive)
                          |       # OR
                        (?1)      # found ( or ), recurse 1st subpattern
                    )*            # end grouping construct
                \\)               # match closing parenthesis
                )                 # end of group 1
             ', x, perl=TRUE))[[1]]

# [1] "(NP (NNP Clifford))"                     
# [2] "(NP (DT the) (JJ big) (JJ red) (NN dog))"
# [3] "(NP (PRP$ my) (NN lunch))"  
hwnd
  • 69,796
  • 4
  • 95
  • 132
0

You can use Avinash Raj's new package:

library(dangas)
extract_all_a("(NP", "))", x, delim=TRUE)
[[1]]
[1] "(NP (NNP Clifford))"                     
[2] "(NP (DT the) (JJ big) (JJ red) (NN dog))"
[3] "(NP (PRP$ my) (NN lunch))"

Github link Here. Install using: devtools::install_github("Avinash-Raj/dangas/dangas")


If you have trouble downloading it, try:

library(stringr)
str_extract_all(x, "\\(NP.*?\\)\\)")

update

@Kevin correctly informed me that I overlooked the balanced paranthesis request. But as you mentioned in the comments, you may not need it for your problem. Please report back if it helps, if not, I will delete.

Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • Does this catch expressions which do not end in `))`? Does it correctly reject expressions which do end in `))` but which are not balanced? – Kevin Oct 02 '15 at 00:46
  • Test it out. @Kevin. It matches the range from the first argument, to the second argument. `delim=TRUE` instructs it to include the arguments in the match. – Pierre L Oct 02 '15 at 00:48
  • That's not what OP asked for. They wanted balanced parentheses. – Kevin Oct 02 '15 at 00:48
  • 1
    It doesn't quiet solve the problem (as Kevin points out) but it may be of great use to future users. +1 – Tyler Rinker Oct 02 '15 at 01:51