parsing unstructured text using pyparsing in Python

Question

I have hundreds of company report .txt files, and I want to extract some information from it. For example, one part of the file looks like this:

Mr. Davido will receive a base salary of $700,000 during the initial and any subsequent 
term. The Chief Executive Officer of the Company (the CEO) and the Board (or a committee
thereof) shall review Mr. Davidos base salary at least annually, and may increase it at 
any time in their sole discretion

I am trying to use pyparsing to extract the base salary value of the guy.

code

from pyparsing import * 

# define grammar
digits = "0123456789"
integer = Word( digits )
money = Group("$"+integer+','+integer + Optional(','+integer , ' '))
start = Word("base salary") 
salary = start + money

#search
for t in text:
  result = salary.parseString( text )
print result

This always gives the error:

pyparsing.ParseException: Expected W:(base...) (at char 0), (line:1, col:1)

After some simple tests, I find that use this code I can only find what I want from the particular form of text which start with:

"base salary $700,000......"

and it can only identify the first case appears in that text.

So I was wondering if someone could help me with it. And, if possible also identify the name of the guy, and store the name and salary into a dataframe.

Thank you so much.

I am going to go ahead and say you cant. Pyparsing is for structured texts, where what you have is a natural language problem. NLTK may (MAY!) be the tool to use... though the tool I would use is interns. — Tritium21, Oct 12 '14 at 12:22

score 1 · Accepted Answer · answered Oct 12 '14 at 13:56

I'll answer your specific question first. parseString is used when you have defined a comprehensive grammar that will match everything from the beginning of the text. Since you are trying to pick out a specific phrase from somewhere in the middle of the input line, use searchString or scanString instead.

As pyparsing's author, I will concur with @Tritium21 - unless there are some specific forms and phrases that you can look for, you will tear your hair out trying to parse this kind of natural language input.

Thank you so much Paul, I will try other toolkit. – Brad Oct 12 '14 at 14:28 — Brad, Oct 12 '14 at 14:28

parsing unstructured text using pyparsing in Python

1 Answers1