Finding strings in PDF and highlight them using Python

Question

I am trying to search strings in PDF and highlight them and save it using Python. The data file is an excel sheet(column 2) and contains special characters as well. I tried using PyMuPDF lib for this but its giving the below error: "

Below is the code used:

#pip install pymupdf
#Library to interact with pdfs
import fitz
import time 
#pip install xlrd
#lib to read excel files
#import xlrd

#Opening the excel file and the specified sheet. Excel File path to be supplied here
wb = xlrd.open_workbook('C:\\Users\\xyz\\Desktop\\PDF Property File.xlsx')
sh = wb.sheet_by_index(0)

#Read data from column:
value_list=sh.col_values(1, start_rowx=1)

### READ IN PDF
## Give the pdf file path here 
doc = fitz.open(r'C:\\Users\\xyz\Desktop\\Test Demo--All Filled.pdf')
page = doc[0]

##IO operaiton 
import os

for page in doc:
    for i in value_list:
        #print(i)
        text_instances = page.searchFor(i)
        timestr = time.strftime("%Y%m%d-%H%M%S")
        for inst in text_instances:
            highlight = page.addHighlightAnnot(inst)
doc.save(r"C:\Users\xyz\Desktop\Output\PDF"+ timestr +".pdf"   , garbage=4, deflate=True, clean=True)
os.system(r'C:\Users\xyz\Desktop\Output\PDF'+ timestr +".pdf")

score 0 · Answer 1 · answered Jun 11 '20 at 20:25

0

The explanation for your error: The entries in value_list are no strings. I don't know the xlrd package, so I cannot give advice how to change that ...

answered Jun 11 '20 at 20:25

Jorj McKie

2,062
1
13
17

score 0 · Answer 2 · answered May 20 '21 at 09:02

0

Adding str() for i will solve the issue.

Do change the line "text_instances = page.searchFor(i) " to text_instances = page.searchFor(str(i)).

answered May 20 '21 at 09:02

Yeshwanth_Mandla

21
5

Finding strings in PDF and highlight them using Python

2 Answers2