Anyway to multithread pdf mining?

Question

I have a code which is looking for a particular string sequence throughout a bunch of pdfs. The problems is that this process is extremely slow. (Sometimes I get pdf's with over 50000 pages)

Is there a way to do multi threading? Unfortunately even though I searched, I couldn't make heads or tails about the threading codes

import os
import shutil as sh
f = 'C:/Users/akhan37/Desktop/learning profiles/unzipped/unzipped_files'

import slate3k as slate


idee = "123456789"
os.chdir(f)
for file in os.listdir('.'):
    print(file) 
    with open(file,'rb') as g:
        extracted_text = slate.PDF(g)

            #extracted_text = slate.PDF() 

        # print(Text)
        if idee in extracted_text:
            print(file)
        else:
            pass

The run time is very long. I don't think it's the codes fault but rather the fact that I have to go through over 700 pdfs

Idk about parallelizing the checking of a single PDF, but you could use `multiprocessing`'s `Pool`'s `map` function to check multiple PDFs at the same time. — Carcigenicate, Oct 24 '19 at 18:36

score 3 · Accepted Answer · answered Oct 24 '19 at 18:38

3

I would suggest using pdfminer, you can convert to the document object into a list of page object, which you can multi-processing on different cores.

    fp = open(pdf_path, "rb")
    parser = PDFParser(fp)
    document = PDFDocument(parser, password)
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed

    laparams = LAParams() # set
    resource_manager = PDFResourceManager()
    device = PDFPageAggregator(resource_manager, laparams=laparams)
    interpreter = PDFPageInterpreter(resource_manager, device)

    all_attributes = []

    list_of_page_obj = list(PDFPage.create_pages(document))

answered Oct 24 '19 at 18:38

Bill Chen

1,699
14
24

I will try this out. Essentially I would need to copy that pdf page to some other pdf. What's going on is this file has a bunch of reports, and I need to get certain pages out in order for people to get their report and the report of others. This is why in my original code, I have a print out of the page number and the file name, so I can go into something like adobe and print a pdf – Moo10000 Oct 24 '19 at 18:51
If you want to get a certain page or a list of page, you can just do `list_of_page_obj[0]` which will give you first page. – Bill Chen Oct 27 '19 at 16:54

Anyway to multithread pdf mining?

1 Answers1

Linked