Regex Expressions For Different PDF's

Question

I'm trying to parse some PDF's, extract the tabular data and output them into JSON files. I'm using regex to search for column values under "Account" and "Allocations". What regex should I use instead? It needs to be general enough to work for all three PDF's. This is an example of the data I'm working on:

PDF 1:

Accounts	Allocation
1 XYZ Corp	6.00
2 BCF	3.00
3 Barings	2.50

PDF 2:

Account	Allocation
1 Amep	$2.0
2 Asset Pioneer	$13.0
3 Creed Partners	$35.5

PDF 3:

Lender	Allocation
15/92 Advertisement Inc.	500.0k
FC LC-New York	2.0m
ABE PARTNERS INC	5.0m

I have these regex patterns, one for company_regex and one for allocation_regex.

company_regex=re.compile(r'^\s*(\d+)\s+[A-Z][a-z]+\s*(?:[A-Z][a-z]+)*)\s+(\$d+(?:\.\d+))\s*$')
allocation_regex= re.compile(r'\$\d+\.\d+')

These work for just one PDF, but I need them to be able to work for both. I have my code formatted like this so that, once the header is recognized, the next row should be recognized as the start of the data. I'm able to successfully recognize all three headers from all pages in all three PDF's. I'm assuming the problem lies in the regex for company & allocation instead:

for i, line in enumerate(lines):
    if header_regex.search(line):
        table_start=i+1
    elif table_start is not None and not line.strip():
        table_end=i
        break

There is no provision in the regex language where "column" has any connotation. — sln, Mar 06 '23 at 21:52

score 0 · Answer 1 · answered Mar 07 '23 at 02:23

It's no clear fo me, what do you exactly mean company or allocation.

Also note ...\$d+..., it would rather be ...\$?\d+.... (maybe that's why it "worked for just one PDF"? - for PDF 2, this is the only one with "$"), (?:\.\d+) ==> (?:\.\d+)? ? etc.

Maybe it could be company_regex (or "company_allocation" ...?) =

re.compile(r'^\s*(\d+)\s+(?P<company>[A-Z][a-zA-Z.]+(?:\s*[A-Z][a-z.]+)*)' \
    + r'\s+(?P<allocation>\$?\d+(?:\.\d+\s*(?:k|m)?)?)\s*$')

(for PDF 1, PDF 2), etc. respectively (for PDF 3)

Look at https://regex101.com/r/dlzOQ6/2 :

^\s*(\d+)\s+(?P<company>[A-Z][a-zA-Z.]+(?:\s*[A-Z][a-z.]+)*)\s+(?P<allocation>\$?\d+(?:\.\d+\s*(?:k|m)?)?)\s*$
|
(?P<company_>(?:\d+\/\d+|[A-Z]+)\s+[A-Z][a-zA-Z\-.]+(?:\s*[A-Z][a-zA-Z.]+)*)\s+(?P<allocation_>\$?\d+(?:\.\d+\s*(?:k|m)?)?)\s*

Regex Expressions For Different PDF's

1 Answers1