Beginner Regex problem with whitespace and backtracking

Question

I'm trying to extract data from a PDF which is in the form of table with headings such as name, country, and various numeric fields.

I am having problems where the names and countries are of different length. I'm also not sure how to get to the numbers as whatever I try misses out the first digit.

e.g.

Sean O'Hair United States 2.758 137.906 50 -7.525 0.000  
 Y.E. Yang Korea 2.734 153.128 56 -6.722 0.000  
 Bo Van Pelt United States 2.733 153.056 56 -4.895 0.000

If you're still working on this, posting example code would be helpful. It's much easier to debug regex problems when we can see the regex that you're using. — Dave DuPlantis, Mar 07 '12 at 14:45

Peter Boughton · Answer 1 · 2013-09-23T18:00:22.897

Unlikely this is still a problem given how old it is, but it's listed as unanswered so for the benefit of anyone with a similar problem...

Here's a quick pattern that'll extract all matches into an array - it may or not need to be made more flexible:

<cfset Matches = rematch( '\D+ \d\.\d{3} \d+\.\d{3} \d\d -\d\.\d{3} 0.000' , Input ) />

Then looping through those results, for each match you can separate the name+country from the numbers with:

<cfset NameAndCountry = trim(Left( CurMatch , refind('\d',CurMatch)-1 )) />
<cfset Numbers = Right( CurMatch , Len(CurMatch)-Len(NameAndCountry) ) />

Extracting the countries from the names is not simple - there aren't really any rules for which is which, so it needs a set of countries to loop through and check against, something like:

<cfloop index="CurCountry" array=#Countries# >
    <cfif NameAndCountry.endsWith( CurCountry ) >
        <cfset Name = Left( NameAndCountry , Len(NameAndCountry)-Len(CurCountry) />
        <cfbreak />
    </cfif>
</cfloop>

For the numbers, using ListToArray with space as delimiter can separate them.

JasonWoof · Answer 2 · 2011-03-17T22:59:45.593

-1

If you pipe your example data through:

sed -e 's/^[^0-9]*//'

it'll strip all the non-number characters from the beginning. Does that help?

P.S. Splitting the name from the country would be tricky, since it looks like there's just a space between, and there's also spaces in the middle of names and countries.

EDIT: Oops, that would remove a minus sign from the first number. Probably better to only remove words (sequences of non-digits followed by a space):

sed -e 's/^\([^0-9 ]* \)*//'

edited Mar 17 '11 at 22:59

answered Mar 17 '11 at 22:42

JasonWoof

4,176
1
19
28

Thanks for your suggestion. I'm not sure it helps as I'm using coldfusion not perl and it appears to have different technique The original pdf had greater number of spaces between country and name so I thought I would be able to use that to ensure that I was extracting the correct data but when I extracted the data there was only one space between each word/number. I'm not sure if there is away around that in the cfpdf tag I used to extract the textword/number. I'm not sure if there is away around that in the cfpdf> tag I used to extract the text – pssguy Mar 19 '11 at 02:33
huh? perl? I'm not using perl either. – JasonWoof Apr 05 '11 at 11:31
sed is a commandline program that does regex replacing among other things. the point is the expressions in my examples. – JasonWoof Apr 05 '11 at 11:32
OK I see. Unix right. I'm really trying to set up a script in coldfusion that loops through a series of pdfs so not sure how the sed program would help – pssguy May 04 '11 at 17:50
sed is just one of many many programs that support regular expressions. Presumably there is some way to do regular expressions in coldfusion. You said you wanted help with regular expressions, so I posted one that seems to work. I passed my expressions to sed in my example code because I figured that would be how you would test your expressions, before pasting them into your coldfusion or whatever. – JasonWoof May 05 '11 at 01:39

Beginner Regex problem with whitespace and backtracking

2 Answers2