I'm trying to parse some PDF's, extract the tabular data and output them into JSON files. I'm using regex to search for column values under "Account" and "Allocations". What regex should I use instead? It needs to be general enough to work for all three PDF's. This is an example of the data I'm working on:
PDF 1:
Accounts | Allocation |
---|---|
1 XYZ Corp | 6.00 |
2 BCF | 3.00 |
3 Barings | 2.50 |
PDF 2:
Account | Allocation |
---|---|
1 Amep | $2.0 |
2 Asset Pioneer | $13.0 |
3 Creed Partners | $35.5 |
PDF 3:
Lender | Allocation |
---|---|
15/92 Advertisement Inc. | 500.0k |
FC LC-New York | 2.0m |
ABE PARTNERS INC | 5.0m |
I have these regex patterns, one for company_regex and one for allocation_regex.
company_regex=re.compile(r'^\s*(\d+)\s+[A-Z][a-z]+\s*(?:[A-Z][a-z]+)*)\s+(\$d+(?:\.\d+))\s*$')
allocation_regex= re.compile(r'\$\d+\.\d+')
These work for just one PDF, but I need them to be able to work for both. I have my code formatted like this so that, once the header is recognized, the next row should be recognized as the start of the data. I'm able to successfully recognize all three headers from all pages in all three PDF's. I'm assuming the problem lies in the regex for company & allocation instead:
for i, line in enumerate(lines):
if header_regex.search(line):
table_start=i+1
elif table_start is not None and not line.strip():
table_end=i
break