2

I have a PDF document with below content (simplified):

pdftotext mypdf.pdf -layout generates:

Contact                     
myemail@domain.com

Now I have created a Python script, that can take "column-like" input, and parse the file accordingly. Consider the below columns:

  1. 11.54
  2. 33.92

Above values are in percentage - meaning that 11.54 is 11.54% from the left of the PDF file.

This is my python script:


column = defaultdict(list)
pdf_file = "mypdf.pdf"

for i, col in enumerate(COLUMNS):

     area[0] = (0, 0, 71, 792) #hardcoded for simplicty.
     area[1] = (71, 0, 137, 792)

     cmd = ['pdftotext', '-f', str(1), '-l', str(1), '-x', str(area[0]), '-y', str(area[1]), '-W', str(area[2]), '-H', str(area[3]), str(pdf_file), '-layout', '-']

     proc = subprocess.Popen(
         cmd, stdout=subprocess.PIPE, bufsize=0, text=True)

      for line in out.splitlines():
          line = str(line)
          column[i + 1].append({"row": str(line)})

Now when pretty printing the column:

pprint(column)

defaultdict(<class 'list'>,
           {1: [{'row': 'Contact'},
               {'row': 'myemail'}],
            2: [{'row': '@domain.com'}]})

As you can see, the word contact is added on the first row, and then the word myemail is added on the second row - as it should. However, when the script is iterating over column 2, it adds the word @domain.com on the first row. But as you can see in the first pdftotext command, the @domain.com is actually on line two.

However, since I am providing coordinates to pdftotext -x -y -W -H, it will start from line one on each new column it is iterating.

The expected result:

defaultdict(<class 'list'>,
           {1: [{'row': 'Contact'},
               {'row': 'myemail'}],
            2: [{'row': '\n'},
               {'row': '@domain.com'}
               ]})

Is there any way that I can do, so that the script knows that there is a line break (for example, by looking at the entire layout output?) dynamically?

Edit

For clarity, let me add the result for each column iterated:

column 1:

Contact
myemail

column 2:

@domain.com

I was wondering if a solution could be to maybe check the results of each column against the full text result line by line? So for example something like:

data = '''Contact
myemail@domain.com'''

for line in data:
    #compare each line of column X against each line in the text.
oliverbj
  • 5,771
  • 27
  • 83
  • 178

0 Answers0