Using pdfplumber to extract text from large pdf files crashes it.
with pdfplumber.open("data/my.pdf") as pdf:
for page in pdf.pages:
**do something**
Using pdfplumber to extract text from large pdf files crashes it.
with pdfplumber.open("data/my.pdf") as pdf:
for page in pdf.pages:
**do something**
Solution found on: https://github.com/jsvine/pdfplumber/issues/193 New:
with pdfplumber.open("data/my.pdf") as pdf:
for page in pdf.pages:
run_my_code()
page.flush_cache()
Old:
with pdfplumber.open("data/my.pdf") as pdf:
for page in pdf.pages:
run_my_code()
del page._objects
del page._layout
These two seem like are the one with the most responsibility for hogging the memory after each loop, deleting it can assist not hogging the computer memory.
If this does not work please try forcing the garbage collector to clean them.
import gc
with pdfplumber.open("data/my.pdf") as pdf:
for page in pdf.pages:
run_my_code()
del page._objects
del page._layout
gc.collect()
I found that after extracting text, the lru_cache was somehow not being cleared causing the memory to keep filling up and eventually run out of it. After some playing around I found the following code helped me. In the code below, I am clearing the page cache and the lru cache.
with pdfplumber.open("path-to-pdf/my.pdf") as pdf:
for page in pdf.pages:
text = page.extract_text(layout=True)
page.flush_cache()
# This was the fn where lru cache is implemented
page.get_text_layout.cache_clear()
PS: I am currently using pdfplumber version 0.71. I hope this helps someone.