0

How to extract text from PDF files for below PDF format. PyPDF2 does not extract the text in a proper readable format.

enter image description here

I have explored PyPDF2 and Pandas. Both are able to extract the data but data is stored as 1 column. I need to store the extracted data as csv files in this desired format.

enter image description here

This is what I have tried

import PyPDF2
import openpyxl

from openpyxl import Workbook

pdfFileObj = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages

pageObj = pdfReader.getPage(0)
mytext = pageObj.extractText()


wb = Workbook()
sheet = wb.active
sheet.title = 'MyPDF'
sheet['A1'] = mytext

wb.save('sample.xlsx')
print('Save')

https://drive.internxt.com/s/file/88ee48b3286a67a853f5/d681ac60df525bbd5fadc14eb433f363b146fbd0065552a8c7ebdea8cc776ed9

kkk
  • 95
  • 1
  • 2
  • 11

3 Answers3

2

Here is a clean and elegant automation code for your use case. Some assumptions that I have made on your pdf.

  1. Lets say I am considering each data as an entity which contains the following fields

enter image description here

Line 1

Our Ref:
Name: 
Ref 1:
Ref 2:

Line2

Amount: 
Total Paid:
Balance: 
Date of A/C: 
Date Received: 

Line3

Last Paid:
Amt Last Paid: 
A/C Status:
Collector : 

Line4

Date (Column name)    
Notes (Column name)

And then multiple line of Date and Notes values
  1. I assume that each data will be separated by one new blank line. As shown here.

enter image description here

  1. Also there will be no other attributes apart from the ones listed above in your data.

  2. Also there will always be 3 line of actual key:value and then the table for Date and Notes start.

CODE

Before you move ahead please install the package pdfplumber pip install pdfplumber

All you need to change in the below code is the pdf_path

import pdfplumber
import re

# regex pattern for keys in line1
my_regex_dict_line1 = {
    'Our Ref' : r'Our Ref :(.*?)Name',
    'Name' : r'Name:(.*?)Ref 1',
    'Ref 1' : r'Ref 1 :(.*?)Ref 2',
    'Ref 2' : r'Ref 2:(.*?)$'
}

# regex pattern for keys in line2
my_regex_dict_line2 = {
    'Amount' : r'Amount:(.*?)Total Paid',
    'Total Paid' : r'Total Paid:(.*?)Balance',
    'Balance' : r'Balance:(.*?)Date of A/C',
    'Date of A/C' : r'Date of A/C:(.*?)Date Received',
    'Date Received' : r'Date Received:(.*?)$'
}

# regex pattern for keys in line3
my_regex_dict_line3 ={
    'Last Paid' : r'Last Paid:(.*?)Amt Last Paid',
    'Amt Last Paid' : r'Amt Last Paid:(.*?)A/C Status',
    'A/C Status': r'A/C Status:(.*?)Collector',
    'Collector' : r'Collector :(.*?)$'
}


def split_on_empty_lines(s):
    ''' This function splits the pdf on data chunks and returns the data chunk in list '''
    blank_line_regex = r"\n *\n{1}"
    return re.split(blank_line_regex, s.strip())


def iterate_through_regex_and_populate_dictionaries(data_dict, regex_dict, text):
    ''' For the given pattern of regex_dict, this function iterates through each regex pattern and adds the key value to regex_dict dictionary '''
    for key, regex in regex_dict.items():
            matched_value = re.search(regex, text)
            if matched_value is not None:
                data_dict[key] = matched_value.group(1).strip()

def populate_date_notes(data_dict, text):
    ''' This function populates date and Notes in the data chunk in the form of list to data_dict dictionary '''
    data_dict['Date'] = []
    data_dict['Notes'] = []
    iter = 4
    while(iter < len(text)):
        date_match = re.search(r'(\d{2}/\d{2}/\d{4})',text[iter])
        data_dict['Date'].append(date_match.group(1).strip())
        notes_match = re.search(r'\d{2}/\d{2}/\d{4}\s*(.*?)$',text[iter])
        data_dict['Notes'].append(notes_match.group(1).strip())
        iter += 1

if(__name__ == '__main__'):
    pdf_path = r'C:\Users\hpoddar\Desktop\Temp\sample.pdf' # ENTER YOUR PDF PATH HERE
    pdf_text = None
    json_data = []

    with pdfplumber.open(pdf_path) as pdf:
        first_page = pdf.pages[0]
        pdf_text  = first_page.extract_text()

        data_list = split_on_empty_lines(pdf_text)
        for data in data_list:
            # split by new line in data chunks
            data_after_split_on_new_line = re.split(r"\n", data.strip())
            data_dict = {}
            # Process line 1 in the data chunk
            iterate_through_regex_and_populate_dictionaries(data_dict, my_regex_dict_line1, data_after_split_on_new_line[0])
            # Process line 2 in the data chunk
            iterate_through_regex_and_populate_dictionaries(data_dict, my_regex_dict_line2, data_after_split_on_new_line[1])
            # Process line 3 in the data chunk
            iterate_through_regex_and_populate_dictionaries(data_dict, my_regex_dict_line3, data_after_split_on_new_line[2])
            # Check if the next line conatins table column Date and Notes
            if(len(data_after_split_on_new_line) > 3 and data_after_split_on_new_line[3] != None and 'Date' in data_after_split_on_new_line[3] and 'Notes' in data_after_split_on_new_line[3]):
                populate_date_notes(data_dict, data_after_split_on_new_line)
            json_data.append(data_dict)
    print(json_data)

This gives us the data in a clean elegant way in json format.

enter image description here

Now that we got the data in a json format, we can load it in a csv, text or data frame format.

Please let me know if you need any information on any part of the code or you need the full explanation of the code.

Himanshu Poddar
  • 7,112
  • 10
  • 47
  • 93
  • thanks, May I know about `if(len(data_after_split_on_new_line) > 3 and data_after_split_on_new_line[3] != None and 'Date' in data_after_split_on_new_line[3] and 'Notes' in data_after_split_on_new_line[3]):` , why > 3 ? If there is a line for new line, can still use as 'blank_line_regex = r"\n *\n{1}"` – kkk Jun 15 '22 at 07:24
  • greater than 3 check is added in case if there are no entries for the Dates and Notes table, we wouldn't go ahead. ALso blank_line_regex is for an empty new blank line. A simple new line can be referred as `\n` – Himanshu Poddar Jun 15 '22 at 07:29
  • your answer is great. I'd like to clarify more about `iter = 4`. Could you please provide full explanation in code ? In the meantime, I have updated PDF file that I mentioned about line. Could you please take a look this updated file and how to extract with current code ? – kkk Jun 15 '22 at 13:18
  • Since the structure of the oage has changed now, this code won't work perfectly on this page, as there are so many other attributes now added to the page, this code will work perfectly for the previous pdf that you added. Could you please ask another question, I ll see if I can answer the same. Also could you please replace the previosu pdf back for this question as the readers might get confused and the code might not work for this doc. – Himanshu Poddar Jun 15 '22 at 13:21
  • As I said iter=4 is added to extract the data for notes and data present in the list after the 3rd index. – Himanshu Poddar Jun 15 '22 at 13:23
  • Yes, I have replaced the previous PDF. Sorry for being late to accept your solution. Thanks for your great code. I'll ask in new post again. If you could answer, please provide it. – kkk Jun 15 '22 at 13:39
  • if you want we can have a call and we can go over the code. – Himanshu Poddar Jun 15 '22 at 13:39
  • 1
    thanks, I think upvoting is pressing up arrow beside answer. Already pressed it as well – kkk Jun 15 '22 at 13:42
  • I have asked in new post for new PDF format. https://stackoverflow.com/questions/72641009/how-to-extract-text-from-pdf-files-and-save-as-json-csv-file-using-python-or-jav – kkk Jun 16 '22 at 05:44
  • Hi @kkk I ll check and get back on this – Himanshu Poddar Jun 16 '22 at 06:20
0

Firstly install tabula by the following command pip install tabula-py

# Import the required Module
import tabula
# Read a PDF File
df = tabula.read_pdf("sample.pdf", pages='all')[0]
# convert PDF into CSV
tabula.convert_into("sample.pdf", "sample.csv", output_format="csv", pages='all')
print(df)
Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72
Kapil Musale
  • 223
  • 1
  • 4
0

I was going to suggest that many extractors work using poppler as their pdf to text engine so to split into columns you could expect something like

enter image description here

However there simply is too much overlap in the various staggered rows. That such an approach would be very variable. However as suggested, the simpler method is work with the bold text features. Thus I would be using a Find and Replace method to add commas to the single column that the page layout provides.

You need , after every /2022, or similar /2021,

likewise before every ,Name: ,Ref # : ,Total Paid: etc

The simplest way is export pdftotext -layout (with any other preferences) out.txt, then parse the text to inject the commas but watch out for existing so 845***Ringing, No reply can be left as it is for 2 columns, but other cases may not be suited and need "quoting".

Your desired output does not match a csv input thus you need a much more custom clip of text after each row entry, then deal with the vertical first two columns as a separate 2nd step.

K J
  • 8,045
  • 3
  • 14
  • 36