extract text and save using python

Question

How to extract text from PDF files for below PDF format. PyPDF2 does not extract the text in a proper readable format.

I have explored PyPDF2 and Pandas. Both are able to extract the data but data is stored as 1 column. I need to store the extracted data as csv files in this desired format.

This is what I have tried

import PyPDF2
import openpyxl

from openpyxl import Workbook

pdfFileObj = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages

pageObj = pdfReader.getPage(0)
mytext = pageObj.extractText()


wb = Workbook()
sheet = wb.active
sheet.title = 'MyPDF'
sheet['A1'] = mytext

wb.save('sample.xlsx')
print('Save')

https://drive.internxt.com/s/file/88ee48b3286a67a853f5/d681ac60df525bbd5fadc14eb433f363b146fbd0065552a8c7ebdea8cc776ed9

convert the pdf to text line by line major, then use text processing tools like regex to achieve the same — Himanshu Poddar, Jun 14 '22 at 07:02
could you please show the code how to regex in python for this kind of text format in PDF ? thanks — kkk, Jun 14 '22 at 07:06
Could you share boiler plate code and the pdf as well please — Himanshu Poddar, Jun 14 '22 at 07:07
Please EDIT the question and add this code, secondly add the link to your pdf file as well — Himanshu Poddar, Jun 14 '22 at 07:12
Edited the question to include the code. Sorry, I'm not able to add the PDF file link. Does Stackoverflow provide the URL for upload files ? — kkk, Jun 14 '22 at 07:20
You can upload the same in any third part storage websites and provide the link here — Himanshu Poddar, Jun 14 '22 at 07:48
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/245597/discussion-between-kkk-and-himanshu-poddar). — kkk, Jun 14 '22 at 13:19
is it guaranteed that every pdf document that you ll be extracting text from using the auomation will contain the text and format in exact same manner? — Himanshu Poddar, Jun 14 '22 at 13:25
okay the code that I am writing will work only with pdf of the format that you have given me — Himanshu Poddar, Jun 14 '22 at 13:28
Hi @kkk I have written an answer to the question, please let me know if you have any doubts, I ll be happy to help — Himanshu Poddar, Jun 14 '22 at 17:26
Also let me know if you need me to code the dumping of the json to a csv file — Himanshu Poddar, Jun 14 '22 at 17:35

Himanshu Poddar · Accepted Answer · 2022-06-15T05:35:15.753

Here is a clean and elegant automation code for your use case. Some assumptions that I have made on your pdf.

Lets say I am considering each data as an entity which contains the following fields

Line 1

Our Ref:
Name: 
Ref 1:
Ref 2:

Line2

Amount: 
Total Paid:
Balance: 
Date of A/C: 
Date Received:

Line3

Last Paid:
Amt Last Paid: 
A/C Status:
Collector :

Line4

Date (Column name)    
Notes (Column name)

And then multiple line of Date and Notes values

I assume that each data will be separated by one new blank line. As shown here.

Also there will be no other attributes apart from the ones listed above in your data.
Also there will always be 3 line of actual key:value and then the table for Date and Notes start.

CODE

Before you move ahead please install the package pdfplumber pip install pdfplumber

All you need to change in the below code is the pdf_path

import pdfplumber
import re

# regex pattern for keys in line1
my_regex_dict_line1 = {
    'Our Ref' : r'Our Ref :(.*?)Name',
    'Name' : r'Name:(.*?)Ref 1',
    'Ref 1' : r'Ref 1 :(.*?)Ref 2',
    'Ref 2' : r'Ref 2:(.*?)$'
}

# regex pattern for keys in line2
my_regex_dict_line2 = {
    'Amount' : r'Amount:(.*?)Total Paid',
    'Total Paid' : r'Total Paid:(.*?)Balance',
    'Balance' : r'Balance:(.*?)Date of A/C',
    'Date of A/C' : r'Date of A/C:(.*?)Date Received',
    'Date Received' : r'Date Received:(.*?)$'
}

# regex pattern for keys in line3
my_regex_dict_line3 ={
    'Last Paid' : r'Last Paid:(.*?)Amt Last Paid',
    'Amt Last Paid' : r'Amt Last Paid:(.*?)A/C Status',
    'A/C Status': r'A/C Status:(.*?)Collector',
    'Collector' : r'Collector :(.*?)$'
}


def split_on_empty_lines(s):
    ''' This function splits the pdf on data chunks and returns the data chunk in list '''
    blank_line_regex = r"\n *\n{1}"
    return re.split(blank_line_regex, s.strip())


def iterate_through_regex_and_populate_dictionaries(data_dict, regex_dict, text):
    ''' For the given pattern of regex_dict, this function iterates through each regex pattern and adds the key value to regex_dict dictionary '''
    for key, regex in regex_dict.items():
            matched_value = re.search(regex, text)
            if matched_value is not None:
                data_dict[key] = matched_value.group(1).strip()

def populate_date_notes(data_dict, text):
    ''' This function populates date and Notes in the data chunk in the form of list to data_dict dictionary '''
    data_dict['Date'] = []
    data_dict['Notes'] = []
    iter = 4
    while(iter < len(text)):
        date_match = re.search(r'(\d{2}/\d{2}/\d{4})',text[iter])
        data_dict['Date'].append(date_match.group(1).strip())
        notes_match = re.search(r'\d{2}/\d{2}/\d{4}\s*(.*?)$',text[iter])
        data_dict['Notes'].append(notes_match.group(1).strip())
        iter += 1

if(__name__ == '__main__'):
    pdf_path = r'C:\Users\hpoddar\Desktop\Temp\sample.pdf' # ENTER YOUR PDF PATH HERE
    pdf_text = None
    json_data = []

    with pdfplumber.open(pdf_path) as pdf:
        first_page = pdf.pages[0]
        pdf_text  = first_page.extract_text()

        data_list = split_on_empty_lines(pdf_text)
        for data in data_list:
            # split by new line in data chunks
            data_after_split_on_new_line = re.split(r"\n", data.strip())
            data_dict = {}
            # Process line 1 in the data chunk
            iterate_through_regex_and_populate_dictionaries(data_dict, my_regex_dict_line1, data_after_split_on_new_line[0])
            # Process line 2 in the data chunk
            iterate_through_regex_and_populate_dictionaries(data_dict, my_regex_dict_line2, data_after_split_on_new_line[1])
            # Process line 3 in the data chunk
            iterate_through_regex_and_populate_dictionaries(data_dict, my_regex_dict_line3, data_after_split_on_new_line[2])
            # Check if the next line conatins table column Date and Notes
            if(len(data_after_split_on_new_line) > 3 and data_after_split_on_new_line[3] != None and 'Date' in data_after_split_on_new_line[3] and 'Notes' in data_after_split_on_new_line[3]):
                populate_date_notes(data_dict, data_after_split_on_new_line)
            json_data.append(data_dict)
    print(json_data)

This gives us the data in a clean elegant way in json format.

Now that we got the data in a json format, we can load it in a csv, text or data frame format.

Please let me know if you need any information on any part of the code or you need the full explanation of the code.

thanks, May I know about `if(len(data_after_split_on_new_line) > 3 and data_after_split_on_new_line[3] != None and 'Date' in data_after_split_on_new_line[3] and 'Notes' in data_after_split_on_new_line[3]):` , why > 3 ? If there is a line for new line, can still use as 'blank_line_regex = r"\n *\n{1}"` — kkk, Jun 15 '22 at 07:24
greater than 3 check is added in case if there are no entries for the Dates and Notes table, we wouldn't go ahead. ALso blank_line_regex is for an empty new blank line. A simple new line can be referred as `\n` — Himanshu Poddar, Jun 15 '22 at 07:29
your answer is great. I'd like to clarify more about `iter = 4`. Could you please provide full explanation in code ? In the meantime, I have updated PDF file that I mentioned about line. Could you please take a look this updated file and how to extract with current code ? — kkk, Jun 15 '22 at 13:18
Since the structure of the oage has changed now, this code won't work perfectly on this page, as there are so many other attributes now added to the page, this code will work perfectly for the previous pdf that you added. Could you please ask another question, I ll see if I can answer the same. Also could you please replace the previosu pdf back for this question as the readers might get confused and the code might not work for this doc. — Himanshu Poddar, Jun 15 '22 at 13:21
As I said iter=4 is added to extract the data for notes and data present in the list after the 3rd index. — Himanshu Poddar, Jun 15 '22 at 13:23
Yes, I have replaced the previous PDF. Sorry for being late to accept your solution. Thanks for your great code. I'll ask in new post again. If you could answer, please provide it. — kkk, Jun 15 '22 at 13:39
thanks, I think upvoting is pressing up arrow beside answer. Already pressed it as well — kkk, Jun 15 '22 at 13:42
I have asked in new post for new PDF format. https://stackoverflow.com/questions/72641009/how-to-extract-text-from-pdf-files-and-save-as-json-csv-file-using-python-or-jav — kkk, Jun 16 '22 at 05:44

score 0 · Answer 2 · edited Jun 14 '22 at 08:40

0

Firstly install tabula by the following command pip install tabula-py

# Import the required Module
import tabula
# Read a PDF File
df = tabula.read_pdf("sample.pdf", pages='all')[0]
# convert PDF into CSV
tabula.convert_into("sample.pdf", "sample.csv", output_format="csv", pages='all')
print(df)

edited Jun 14 '22 at 08:40

Davide Fiocco

5,350
5
35
72

answered Jun 14 '22 at 08:39

Kapil Musale

223
1
4

Thanks for guiding, unfortunately there is no java in my PC. – kkk Jun 14 '22 at 08:49

K J · Answer 3 · 2022-06-14T16:52:19.330

I was going to suggest that many extractors work using poppler as their pdf to text engine so to split into columns you could expect something like

However there simply is too much overlap in the various staggered rows. That such an approach would be very variable. However as suggested, the simpler method is work with the bold text features. Thus I would be using a Find and Replace method to add commas to the single column that the page layout provides.

You need , after every /2022, or similar /2021,

likewise before every ,Name: ,Ref # : ,Total Paid: etc

The simplest way is export pdftotext -layout (with any other preferences) out.txt, then parse the text to inject the commas but watch out for existing so 845***Ringing, No reply can be left as it is for 2 columns, but other cases may not be suited and need "quoting".

Your desired output does not match a csv input thus you need a much more custom clip of text after each row entry, then deal with the vertical first two columns as a separate 2nd step.

extract text and save using python

3 Answers3

Linked