0

I have hundreds of company report .txt files, and I want to extract some information from it. For example, one part of the file looks like this:

Mr. Davido will receive a base salary of $700,000 during the initial and any subsequent 
term. The Chief Executive Officer of the Company (the CEO) and the Board (or a committee
thereof) shall review Mr. Davidos base salary at least annually, and may increase it at 
any time in their sole discretion

I am trying to use pyparsing to extract the base salary value of the guy.

code

from pyparsing import * 

# define grammar
digits = "0123456789"
integer = Word( digits )
money = Group("$"+integer+','+integer + Optional(','+integer , ' '))
start = Word("base salary") 
salary = start + money

#search
for t in text:
  result = salary.parseString( text )
print result

This always gives the error:

pyparsing.ParseException: Expected W:(base...) (at char 0), (line:1, col:1)

After some simple tests, I find that use this code I can only find what I want from the particular form of text which start with:

"base salary $700,000......"

and it can only identify the first case appears in that text.

So I was wondering if someone could help me with it. And, if possible also identify the name of the guy, and store the name and salary into a dataframe.

Thank you so much.

Brad
  • 569
  • 1
  • 4
  • 8
  • 1
    I am going to go ahead and say you cant. Pyparsing is for structured texts, where what you have is a natural language problem. NLTK may (MAY!) be the tool to use... though the tool I would use is interns. – Tritium21 Oct 12 '14 at 12:22
  • Thanks a lot @Tritium21, I will give NLTK a try. – Brad Oct 12 '14 at 14:30

1 Answers1

1

I'll answer your specific question first. parseString is used when you have defined a comprehensive grammar that will match everything from the beginning of the text. Since you are trying to pick out a specific phrase from somewhere in the middle of the input line, use searchString or scanString instead.

As pyparsing's author, I will concur with @Tritium21 - unless there are some specific forms and phrases that you can look for, you will tear your hair out trying to parse this kind of natural language input.

PaulMcG
  • 62,419
  • 16
  • 94
  • 130