My code extracts texts from PDF files, and compares the info. It seems that my code fails while executing Pdfs of large sizes

Question

I am able to use my code to compare PDFs of smaller sizes, but when it is used for large size PDFs it fails and shows all sorts of error messages. Below is my code:

`

import pdfminer
import pandas as pd
from time import sleep
from tqdm import tqdm
from itertools import chain
import slate



# List of pdf files to process
pdf_files = ['file1.pdf', 'file2.pdf']

# Create a list to store the text from each PDF
pdf1_text = []
pdf2_text = []

# Iterate through each pdf file
for pdf_file in tqdm(pdf_files):
    # Open the pdf file
    with open(pdf_file, 'rb') as pdf_now:
        # Extract text using slate
        text = slate.PDF(pdf_now)
        text = text[0].split('\n')
        if pdf_file == pdf_files[0]:    
            pdf1_text.append(text)
        else:
            pdf2_text.append(text)

    sleep(20)

pdf1_text = list(chain.from_iterable(pdf1_text))
pdf2_text = list(chain.from_iterable(pdf2_text))

differences = set(pdf1_text).symmetric_difference(pdf2_text)

## Create a new dataframe to hold the differences
differences_df = pd.DataFrame(columns=['pdf1_text', 'pdf2_text'])

# Iterate through the differences and add them to the dataframe
for difference in differences:
    # Create a new row in the dataframe with the difference from pdf1 and pdf2
    differences_df = differences_df.append({'pdf1_text': difference if difference in pdf1_text else '',
                                            'pdf2_text': difference if difference in pdf2_text else ''}, ignore_index=True)

# Write the dataframe to an excel sheet
differences_df = differences_df.applymap(lambda x: x.encode('unicode_escape').decode('utf-8') if isinstance(x, str) else x)

differences_df.to_excel('differences.xlsx', index=False, engine='openpyxl')


import openpyxl

import re

# Load the Excel file into a dataframe
df = pd.read_excel("differences.xlsx")

# Create a condition to check the number of words in each cell
for column in ["pdf1_text", "pdf2_text"]:
    df[f"{column}_word_count"] = df[column].str.split().str.len()
    condition = df[f"{column}_word_count"] < 10
    # Drop the rows that meet the condition
    df = df[~condition]

for column in ["pdf1_text", "pdf2_text"]:
    df = df.drop(f"{column}_word_count", axis=1)


# Save the modified dataframe to a new Excel file
df.to_excel("differences.xlsx", index=False)

The last error I got was this. Can anyone please go through the code, and help me find what the actual problem would be.

TypeError: %d format: a real number is required, not bytes

Could you edit your post with the full stack trace (i.e. not only the last line with the "TypeError")? — slothrop, Feb 01 '23 at 10:10

score 0 · Accepted Answer · answered Feb 01 '23 at 10:20

0

If you really want to boost the speed of your script by at least an order of magnitude, I recommend using PyMuPDF instead of PyPDF2 or pdfminer. I am usually measuring durations that are 10 to 35 times (!) smaller. And of course, no time.sleep() - why would you ever want to artificially slow down processing?

Here is how reading the text lines of the two PDFs would work with PyMuPDF:

import fitz  # PyMuPDF

doc1 = fitz.open("file1.pdf")
doc2 = fitz.open("file2.pdf")

text1 = "\n".join([page.get_text() for page in doc1])
text2 = "\n".join([page.get_text() for page in doc2])

lines1 = text1.splitlines()
lines2 = text2.splitlines()

# then do your comparison ...

answered Feb 01 '23 at 10:20

Jorj McKie

2,062
1
13
17

Thank you Jorj! But, how can I improve the quality of text extracted? Any way I use, I get texts along withe escape characters, and sometimes mixed up texts from here and there. Slate was really useful in this, it gave me the exact lines from the PDF, even though there were some inconsistencies. Any way to improve that? – hexapod Feb 01 '23 at 10:52
@hexapod - sure! I just mentioned the most elementary text extraction variant. There are ways to deal with scrambled text sequences, dealing with white space, returning text positions down to single characters, font properties (boldness, font sizes, etc.) and text color. You probably are suffering from text where the read-out sequence is unequal to the natural reading sequence, so some sorting must occur. A frist step is using `page.get_text(sort=True)` in my example. But we may become much more sophisticated, too. Probably need an example that causes pain to demonstrate all this. – Jorj McKie Feb 01 '23 at 17:39

My code extracts texts from PDF files, and compares the info. It seems that my code fails while executing Pdfs of large sizes

1 Answers1