I have a PDF document with below content (simplified):
pdftotext mypdf.pdf -layout
generates:
Contact
myemail@domain.com
Now I have created a Python script, that can take "column-like" input, and parse the file accordingly. Consider the below columns:
- 11.54
- 33.92
Above values are in percentage - meaning that 11.54 is 11.54% from the left of the PDF file.
This is my python script:
column = defaultdict(list)
pdf_file = "mypdf.pdf"
for i, col in enumerate(COLUMNS):
area[0] = (0, 0, 71, 792) #hardcoded for simplicty.
area[1] = (71, 0, 137, 792)
cmd = ['pdftotext', '-f', str(1), '-l', str(1), '-x', str(area[0]), '-y', str(area[1]), '-W', str(area[2]), '-H', str(area[3]), str(pdf_file), '-layout', '-']
proc = subprocess.Popen(
cmd, stdout=subprocess.PIPE, bufsize=0, text=True)
for line in out.splitlines():
line = str(line)
column[i + 1].append({"row": str(line)})
Now when pretty printing the column
:
pprint(column)
defaultdict(<class 'list'>,
{1: [{'row': 'Contact'},
{'row': 'myemail'}],
2: [{'row': '@domain.com'}]})
As you can see, the word contact
is added on the first row, and then the word myemail
is added on the second row - as it should.
However, when the script is iterating over column 2, it adds the word @domain.com
on the first row. But as you can see in the first pdftotext command, the @domain.com
is actually on line two.
However, since I am providing coordinates to pdftotext -x -y -W -H
, it will start from line one on each new column it is iterating.
The expected result:
defaultdict(<class 'list'>,
{1: [{'row': 'Contact'},
{'row': 'myemail'}],
2: [{'row': '\n'},
{'row': '@domain.com'}
]})
Is there any way that I can do, so that the script knows that there is a line break (for example, by looking at the entire layout output?) dynamically?
Edit
For clarity, let me add the result for each column iterated:
column 1:
Contact
myemail
column 2:
@domain.com
I was wondering if a solution could be to maybe check the results of each column against the full text result line by line? So for example something like:
data = '''Contact
myemail@domain.com'''
for line in data:
#compare each line of column X against each line in the text.