Split image/pdf based on specific text with Python

Question

I want to split a pdf (or image if needed) based on text in it. I want to split it to get each question with its options in the pdf/image, separately like a screenshot of just that question and its options.

Sample PDF link:https://drive.google.com/file/d/1UtMropzRdfJwQjaRf9kZa1UpAzrKlH-K/view?usp=sharing

Is it even possible? If yes what is the code needed to accomplish this. I am a newbie to python so some explanation might help. I've got almost 100 of these PDFs and just wanted to automate the process of getting individual question and its options.

@YashMakan https://drive.google.com/file/d/1UtMropzRdfJwQjaRf9kZa1UpAzrKlH-K/view?usp=sharing — Whiskey Jay, Dec 04 '20 at 12:10
I see that you know of pypdf2 already. I would go the route of reading/parsing the PDF rather than image processing and OCR. that would only destroy information you already have. — Christoph Rackwitz, Dec 04 '20 at 12:17
I mean use pypdf2 (or another library/program) to read the PDF file and give you plain text. PDF files may contain only individual glyphs with no "word" or "sentence" structure. PDF reading libraries have to infer (parse) that, so it's not perfect sometimes. — Christoph Rackwitz, Dec 04 '20 at 12:52
@ChristophRackwitz I want to keep the format of the questions similar to the one in PDF as many of the options are in image format and won't appear in parsed document. So is it possible to split the pdf itself based on some specific 'mark' like question number or something? — Whiskey Jay, Dec 04 '20 at 13:00
I'm sure it's possible to read the PDF file into its constituent parts, which is text with positions, and images with positions (and sizes). please research the existing libraries and their capabilities. — Christoph Rackwitz, Dec 04 '20 at 13:33

score 0 · Answer 1 · answered Dec 04 '20 at 12:39

Step1: You simply need to install pdftotext and put the .exe in the same working directory.
Step2: Copy the code down below and paste it in the same directory.
step3: Also keep in mind that the pdf files should also be in the same directory
step4: Run the .py file

Complete Code that worked for me :

import os 
import glob 
import subprocess 
files=[]
#remember to put your pdftotxt.exe to the folder with your pdf files  
for filename in glob.glob(os.getcwd() + '\\*.pdf'):
    files.append(filename[0:-4]+".txt")
    subprocess.call([os.getcwd() + '\\pdftotext', filename, filename[0:-4]+".txt"]) 
all_files=[]
for i in range(len(files)):
    with open(files[i],'r') as f:
        text=f.read()
        text=text.split('carry one mark each')[1].split('WWW.UNITOPERATION.COM')[0]
        text_ls=text.splitlines()
        ques=[]
        counter=1
        for i in range(len(text_ls)):
            if text_ls[i].startswith(str(counter)+'.'):
                ques.append(''.join(text_ls[i:]).split('\n'[0]))
                counter+=1
    all_files.append(ques)

# Now you have list of all_files in which ques list is added
# You simply need take one by one element out from all_files and write it in a .txt file
# and that will be your task

Believe it or not but all the other pdf reader libraries do not work this perfectly like the pdftotext.exe — Yash Makan, Dec 04 '20 at 12:41
Thanks that actually helped. I actually had tried this method(although I used an online API to do it). However many of the options are images rather than text. And there are tables and matrices as well which don't get formatted well in text format. So I thought it would be easier to split the pdf itself or convert it into an image then splitting it with something like opencv. Its just that I don't know how to accomplish that. — Whiskey Jay, Dec 04 '20 at 12:55
I don't have an answer for the option with images right now. If it actually helps then kindly upvote the comment because it' makes me motivated to solve problems — Yash Makan, Dec 04 '20 at 13:02

Split image/pdf based on specific text with Python

1 Answers1