2

My question has to do with post-processing of part-of-speech tagged and parsed natural language sentences. Specifically, I am writing a component of a Lisp post-processor that takes as input a sentence parse tree (such as, for example, one produced by the Stanford Parser), extracts from that parse tree the phrase structure rules invoked to generate the parse, and then produces a table of rules and rule counts. An example of input and output would be the following:

(1) Sentence:

John said that he knows who Mary likes

(2) Parser output:

(ROOT
  (S
    (NP (NNP John))
    (VP (VBD said)
      (SBAR (IN that)
        (S
          (NP (PRP he))
          (VP (VBZ knows)
            (SBAR
              (WHNP (WP who))
              (S
                (NP (NNP Mary))
                (VP (VBZ likes))))))))))

(3) My Lisp program post-processor output for this parse tree:

(S --> NP VP)             3
(NP --> NNP)              2
(VP --> VBZ)              1
(WHNP --> WP)             1
(SBAR --> WHNP S)         1
(VP --> VBZ SBAR)         1
(NP --> PRP)              1
(SBAR --> IN S)           1
(VP --> VBD SBAR)         1
(ROOT --> S)              1

Note the lack of punctuation in sentence (1). That's intentional. I am having trouble parsing the punctuation in Lisp -- precisely because some punctuation (commas, for example) are reserved for special purposes. But parsing sentences without punctuation changes the distribution of the parse rules as well as the symbols contained in those rules, as illustrated by the following:

(4) Input sentence:

I said no and then I did it anyway

(5) Parser output:

(ROOT
  (S
    (NP (PRP I))
    (VP (VBD said)
      (ADVP (RB no)
        (CC and)
        (RB then))
      (SBAR
        (S
          (NP (PRP I))
          (VP (VBD did)
            (NP (PRP it))
            (ADVP (RB anyway))))))))

(6) Input sentence (with punctuation):

I said no, and then I did it anyway.

(7) Parser output:

 (ROOT
   (S
     (S
       (NP (PRP I))
       (VP (VBD said)
         (INTJ (UH no))))
     (, ,)
     (CC and)
     (S
       (ADVP (RB then))
       (NP (PRP I))
       (VP (VBD did)
         (NP (PRP it))
         (ADVP (RB anyway))))
     (. .)))

Note how including punctuation completely rearranges the parse tree and also involves different POS tags (and thus, implies that different grammar rules were invoked to produce it) So including punctuation is important, at least for my application.

What I need is to discover a way to include punctuation in rules, so that I can produce rules like the following, which would appear, for example, in the table like (3), as follows:

(8) Desired rule:

S --> S , CC S .

Rules like (8) are in fact desired for the specific application I am writing.

But I am finding that doing this in Lisp is difficult: In (7), for example, we observe the appearance of (, ,) and (. .) , both of which are problematic to handle in Lisp.

I have included my relevant Lisp code below. Please note that I'm a neophyte Lisp hacker and so my code isn't particularly pretty or efficient. If someone could suggest how I might modify my below code such that I can parse (7) to produce a table like (3) that includes a rule like (8), I would be most appreciative.

Here is my Lisp code relevant to this task:

(defun WRITE-RULES-AND-COUNTS-SORTED (sent)
  (multiple-value-bind (rules-list counts-list)
      (COUNT-RULES-OCCURRENCES sent)
    (setf comblist (sort (pairlis rules-list counts-list) #'> :key #'cdr))
    (format t "~%")
    (do ((i 0 (incf i)))
        ((= i (length comblist)) NIL)
      (format t "~A~26T~A~%" (car (nth i comblist)) (cdr (nth i comblist))))
    (format t "~%")))


 (defun COUNT-RULES-OCCURRENCES (sent)
   (let* ((original-rules-list (EXTRACT-GRAMMAR sent))
          (de-duplicated-list (remove-duplicates original-rules-list :test #'equalp))
          (count-list nil))
     (dolist (i de-duplicated-list)
       (push (reduce #'+ (mapcar #'(lambda (x) (if (equalp x i) 1 0)) original-rules-list) ) count-list))
     (setf count-list (nreverse count-list))
    (values de-duplicated-list count-list)))


 (defun EXTRACT-GRAMMAR (sent &optional (rules-stack nil))
   (cond ((null sent) 
          NIL)
         ((and (= (length sent) 1)
               (listp (first sent))
               (= (length (first sent)) 2)
               (symbolp (first (first sent)))
               (symbolp (second (first sent))))
          NIL)
         ((and (symbolp (first sent)) 
               (symbolp (second sent)) 
               (= 2 (length sent)))
          NIL)
         ((symbolp (first sent))
          (push (EXTRACT-GRAMMAR-RULE sent) rules-stack)
          (append rules-stack (EXTRACT-GRAMMAR (rest sent)   )))
         ((listp (first sent))
          (cond ((not (and (listp (first sent)) 
                           (= (length (first sent)) 2) 
                           (symbolp (first (first sent))) 
                           (symbolp (second (first sent)))))
                 (push (EXTRACT-GRAMMAR-RULE (first sent)) rules-stack)
                 (append rules-stack (EXTRACT-GRAMMAR (rest (first sent))) (EXTRACT-GRAMMAR (rest sent) )))
               (t (append rules-stack (EXTRACT-GRAMMAR (rest sent)  )))))))


(defun EXTRACT-GRAMMAR-RULE (sentence-or-phrase)
  (append (list (first sentence-or-phrase))
          '(-->)
          (mapcar #'first (rest sentence-or-phrase))))

The code is invoked as follows (using (1) as input, producing (3) as output):

(WRITE-RULES-AND-COUNTS-SORTED  '(ROOT
  (S
    (NP (NNP John))
    (VP (VBD said)
      (SBAR (IN that)
        (S
          (NP (PRP he))
          (VP (VBZ knows)
            (SBAR
              (WHNP (WP who))
              (S
                (NP (NNP Mary))
                (VP (VBZ likes)))))))))))
user3990797
  • 87
  • 1
  • 7

2 Answers2

4

S-expressions in Common Lisp

In Common Lisp s-expressions characters like ,, . and others are a part of the default syntax.

If you want symbols with arbitrary names in Lisp s-expressions, you have to escape them. Either use a backslash to escape single characters or use a pair of vertical bars to escape multiple characters:

CL-USER 2 > (loop for symbol in '(\, \. | a , b , c .|)
                  do (describe symbol))

\, is a SYMBOL
NAME          ","
VALUE         #<unbound value>
FUNCTION      #<unbound function>
PLIST         NIL
PACKAGE       #<The COMMON-LISP-USER package, 76/256 internal, 0/4 external>

\. is a SYMBOL
NAME          "."
VALUE         #<unbound value>
FUNCTION      #<unbound function>
PLIST         NIL
PACKAGE       #<The COMMON-LISP-USER package, 76/256 internal, 0/4 external>

| a , b , c .| is a SYMBOL
NAME          " a , b , c ."
VALUE         #<unbound value>
FUNCTION      #<unbound function>
PLIST         NIL
PACKAGE       #<The COMMON-LISP-USER package, 76/256 internal, 0/4 external>
NIL

Tokenizing / Parsing

If you want to deal with other input formats and not s-expressions, you might want to tokenize / parse the input yourself.

Primitive example:

CL-USER 11 > (mapcar (lambda (string)
                       (intern string "CL-USER"))
                     (split-sequence " " "S --> S , CC S ."))
(S --> S \, CC S \.)
Rainer Joswig
  • 136,269
  • 10
  • 221
  • 346
0

UPDATE:

Thank you Dr. Joswig, for your comments and for your code demo: Both were quite helpful.

In the above question I'm interested in overcoming the fact that , and . are part of Lisp's default syntax (or at least accommodating that fact). And so what I ended up doing is writing the function PRODUCE-PARSE-TREE-WITH-PUNCT-FROM-FILE-READ. What it does is read in one parse tree from a file, as a series of strings; trims white-space from the strings; concatenates the strings together to form a string representation of the parse tree; and then scans this string, character by character, searching for instances of punctuation to modify. The modification implements Dr. Joswig's suggestion. Finally, the modified string is converted to a tree (list representation) and then sent off to the extractor to produce the rules table and counts. To implement I cobbled together bits of code found elsewhere on StackOverflow along with my own original code. The result (not all punctuation can be handled of course since this is just a demo):

(defun PRODUCE-PARSE-TREE-WITH-PUNCT-FROM-FILE-READ (file-name)
  (let ((result (make-array 1 :element-type 'character :fill-pointer 0 :adjustable T))
        (list-of-strings-to-process (mapcar #'(lambda (x) (string-trim " " x)) 
                                      (GET-PARSE-TREE-FROM-FILE file-name)))
        (concatenated-string nil)
        (punct-list '(#\, #\. #\; #\: #\! #\?))
        (testchar nil)
        (string-length 0))
    (setf concatenated-string (format nil "~{ ~A~}" list-of-strings-to-process))
    (setf string-length (length concatenated-string))
    (do ((i 0 (incf i)))
        ((= i string-length) NIL)
      (setf testchar (char concatenated-string i))
      (cond ((member testchar punct-list)
             (vector-push-extend #\| result)
             (vector-push-extend testchar result)
             (vector-push-extend #\| result))
            (t (vector-push-extend testchar result))))
    (reverse result)
    (with-input-from-string (s result)
      (loop for x = (read s nil :end) until (eq x :end) collect x))))


(defun GET-PARSE-TREE-FROM-FILE (file-name)
  (with-open-file (stream file-name)
    (loop for line = (read-line stream nil)
        while line
        collect line)))

Note that GET-PARSE-TREE-FROM-FILE reads only one tree from a file that consists of only one tree. These two functions are not, of course, ready for prime-time!

And finally, a parse tree containing (Lisp-reserved) punctuation can be processed--and thus the original goal met--as follows (user supplies the filename containing one parse tree):

 (WRITE-RULES-AND-COUNTS-SORTED 
              (PRODUCE-PARSE-TREE-WITH-PUNCT-FROM-FILE-READ filename))

The following output is produced:

(NP --> PRP)                  3
(PP --> IN NP)                2
(VP --> VB PP)                1
(S --> VP)                    1
(VP --> VBD)                  1
(NP --> NN CC NN)             1
(ADVP --> RB)                 1
(PRN --> , ADVP PP ,)         1
(S --> PRN NP VP)             1
(WHADVP --> WRB)              1
(SBAR --> WHADVP S)           1
(NP --> NN)                   1
(NP --> DT NN)                1
(ADVP --> NP IN)              1
(VP --> VBD ADVP NP , SBAR)   1
(S --> NP VP)                 1
(S --> S : S .)               1
(ROOT --> S)                  1

That output was the result of using the following input (saved as filename):

(ROOT
  (S
    (S
      (NP (PRP It))
      (VP (VBD was)
        (ADVP
          (NP (DT the) (NN day))
          (IN before))
        (NP (NN yesterday))
        (, ,)
        (SBAR
          (WHADVP (WRB when))
          (S
            (PRN (, ,)
              (ADVP (RB out))
              (PP (IN of)
                (NP (NN happiness)
                  (CC and)
                  (NN mirth)))
              (, ,))
            (NP (PRP I))
            (VP (VBD decided))))))
    (: :)
    (S
      (VP (VB go)
        (PP (IN for)
          (NP (PRP it)))))
    (. !)))
user3990797
  • 87
  • 1
  • 7