how to recognize a graph in pdf using python?

Question

new to pdf parsing.

I want to recognize a graph in a pdf file, so I could skip it and not extract this type of text. all I know about the pdf is that it is generated from word (not scanned).

Input - pdf with a graph such as this one. output should be - true or false

pdfplumber recognize tables but doesn't seem to recognize graphs. tried recognizing curves and rectangles but results are not consistent.

maybe there's another way?

Thank you!

If you have MS Word on your machine, you could use pywin32 to read the PDF into Word. The Word object model _does_ treat graphs separately, so you could get just the text. — G5W, Nov 18 '22 at 16:38

score 0 · Answer 1 · answered Nov 22 '22 at 09:37

option 1:

(thanks to @KJ comment) I ended up using some bulk estimations to understand if the page contains a graph or not.

If there're more than MIN_RECTS in a page I assume there's a graph there (with columns that precived as rectengels) or if there's more than MIN_CURVES than there's a graph (for me it was 0, but it depends if you have some non-trivial shapes in the header or footer). It's not the best but it works most of the time.

example for some code - using both functions and extract_text() afterwards leads to pretty good results for me.

page = pdfplumber.open("file.pdf").pages[0]

def contains_graphs(page):
  return len(page.rects) > MIN_RECTS or len(page.curves) > MIN_CURVES 

def only_chars_from_page_filter(page):
  return page.filter(lambda obj: obj["object_type"] == "char")

option 2:

following @G5W's comment, it is possible to convert PDF to MS Word file with pywin32 to read the PDF into Word, then use extract text only with python-docx for example.

how to recognize a graph in pdf using python?

1 Answers1

option 1:

option 2: