My question has to do with post-processing of part-of-speech tagged and parsed natural language sentences. Specifically, I am writing a component of a Lisp post-processor that takes as input a sentence parse tree (such as, for example, one produced by the Stanford Parser), extracts from that parse tree the phrase structure rules invoked to generate the parse, and then produces a table of rules and rule counts. An example of input and output would be the following:
(1) Sentence:
John said that he knows who Mary likes
(2) Parser output:
(ROOT
(S
(NP (NNP John))
(VP (VBD said)
(SBAR (IN that)
(S
(NP (PRP he))
(VP (VBZ knows)
(SBAR
(WHNP (WP who))
(S
(NP (NNP Mary))
(VP (VBZ likes))))))))))
(3) My Lisp program post-processor output for this parse tree:
(S --> NP VP) 3
(NP --> NNP) 2
(VP --> VBZ) 1
(WHNP --> WP) 1
(SBAR --> WHNP S) 1
(VP --> VBZ SBAR) 1
(NP --> PRP) 1
(SBAR --> IN S) 1
(VP --> VBD SBAR) 1
(ROOT --> S) 1
Note the lack of punctuation in sentence (1). That's intentional. I am having trouble parsing the punctuation in Lisp -- precisely because some punctuation (commas, for example) are reserved for special purposes. But parsing sentences without punctuation changes the distribution of the parse rules as well as the symbols contained in those rules, as illustrated by the following:
(4) Input sentence:
I said no and then I did it anyway
(5) Parser output:
(ROOT
(S
(NP (PRP I))
(VP (VBD said)
(ADVP (RB no)
(CC and)
(RB then))
(SBAR
(S
(NP (PRP I))
(VP (VBD did)
(NP (PRP it))
(ADVP (RB anyway))))))))
(6) Input sentence (with punctuation):
I said no, and then I did it anyway.
(7) Parser output:
(ROOT
(S
(S
(NP (PRP I))
(VP (VBD said)
(INTJ (UH no))))
(, ,)
(CC and)
(S
(ADVP (RB then))
(NP (PRP I))
(VP (VBD did)
(NP (PRP it))
(ADVP (RB anyway))))
(. .)))
Note how including punctuation completely rearranges the parse tree and also involves different POS tags (and thus, implies that different grammar rules were invoked to produce it) So including punctuation is important, at least for my application.
What I need is to discover a way to include punctuation in rules, so that I can produce rules like the following, which would appear, for example, in the table like (3), as follows:
(8) Desired rule:
S --> S , CC S .
Rules like (8) are in fact desired for the specific application I am writing.
But I am finding that doing this in Lisp is difficult: In (7), for example, we observe the appearance of (, ,) and (. .) , both of which are problematic to handle in Lisp.
I have included my relevant Lisp code below. Please note that I'm a neophyte Lisp hacker and so my code isn't particularly pretty or efficient. If someone could suggest how I might modify my below code such that I can parse (7) to produce a table like (3) that includes a rule like (8), I would be most appreciative.
Here is my Lisp code relevant to this task:
(defun WRITE-RULES-AND-COUNTS-SORTED (sent)
(multiple-value-bind (rules-list counts-list)
(COUNT-RULES-OCCURRENCES sent)
(setf comblist (sort (pairlis rules-list counts-list) #'> :key #'cdr))
(format t "~%")
(do ((i 0 (incf i)))
((= i (length comblist)) NIL)
(format t "~A~26T~A~%" (car (nth i comblist)) (cdr (nth i comblist))))
(format t "~%")))
(defun COUNT-RULES-OCCURRENCES (sent)
(let* ((original-rules-list (EXTRACT-GRAMMAR sent))
(de-duplicated-list (remove-duplicates original-rules-list :test #'equalp))
(count-list nil))
(dolist (i de-duplicated-list)
(push (reduce #'+ (mapcar #'(lambda (x) (if (equalp x i) 1 0)) original-rules-list) ) count-list))
(setf count-list (nreverse count-list))
(values de-duplicated-list count-list)))
(defun EXTRACT-GRAMMAR (sent &optional (rules-stack nil))
(cond ((null sent)
NIL)
((and (= (length sent) 1)
(listp (first sent))
(= (length (first sent)) 2)
(symbolp (first (first sent)))
(symbolp (second (first sent))))
NIL)
((and (symbolp (first sent))
(symbolp (second sent))
(= 2 (length sent)))
NIL)
((symbolp (first sent))
(push (EXTRACT-GRAMMAR-RULE sent) rules-stack)
(append rules-stack (EXTRACT-GRAMMAR (rest sent) )))
((listp (first sent))
(cond ((not (and (listp (first sent))
(= (length (first sent)) 2)
(symbolp (first (first sent)))
(symbolp (second (first sent)))))
(push (EXTRACT-GRAMMAR-RULE (first sent)) rules-stack)
(append rules-stack (EXTRACT-GRAMMAR (rest (first sent))) (EXTRACT-GRAMMAR (rest sent) )))
(t (append rules-stack (EXTRACT-GRAMMAR (rest sent) )))))))
(defun EXTRACT-GRAMMAR-RULE (sentence-or-phrase)
(append (list (first sentence-or-phrase))
'(-->)
(mapcar #'first (rest sentence-or-phrase))))
The code is invoked as follows (using (1) as input, producing (3) as output):
(WRITE-RULES-AND-COUNTS-SORTED '(ROOT
(S
(NP (NNP John))
(VP (VBD said)
(SBAR (IN that)
(S
(NP (PRP he))
(VP (VBZ knows)
(SBAR
(WHNP (WP who))
(S
(NP (NNP Mary))
(VP (VBZ likes)))))))))))