0

Using pdfplumber to extract text from large pdf files crashes it.

with pdfplumber.open("data/my.pdf") as pdf:
    for page in pdf.pages:
        **do something**
Filipe Lemos
  • 500
  • 3
  • 13

2 Answers2

0

Solution found on: https://github.com/jsvine/pdfplumber/issues/193 New:

with pdfplumber.open("data/my.pdf") as pdf:
    for page in pdf.pages:
        run_my_code()
        page.flush_cache()

Old:

with pdfplumber.open("data/my.pdf") as pdf:
    for page in pdf.pages:
        run_my_code()
        del page._objects
        del page._layout

These two seem like are the one with the most responsibility for hogging the memory after each loop, deleting it can assist not hogging the computer memory.

If this does not work please try forcing the garbage collector to clean them.

import gc
with pdfplumber.open("data/my.pdf") as pdf:
    for page in pdf.pages:
        run_my_code()
        del page._objects
        del page._layout
        gc.collect()
Filipe Lemos
  • 500
  • 3
  • 13
0

I found that after extracting text, the lru_cache was somehow not being cleared causing the memory to keep filling up and eventually run out of it. After some playing around I found the following code helped me. In the code below, I am clearing the page cache and the lru cache.

with pdfplumber.open("path-to-pdf/my.pdf") as pdf:
for page in pdf.pages:
    text = page.extract_text(layout=True)
    page.flush_cache()

    # This was the fn where lru cache is implemented
    page.get_text_layout.cache_clear()     
    

PS: I am currently using pdfplumber version 0.71. I hope this helps someone.

Nav
  • 741
  • 1
  • 6
  • 12