0

I'm trying to take all data from PDFs.

I also want to identify bulletpoints in the PDF, but that time, i'm getting bulletpoints that when i manually copy from the PDF and paste somewhere else it just paste the string (image.png) for all the bulletpoints.

I tried to extract all the images but that bullets arent coming. I also tried to find them in the text, it's also not coming.

Do you guys ever faced this problem?

An example of the PDF part i cant extract: pdf_image

I just want to have find those bulletpoints into the extracted text.

I tried to take the raw data from the extracted PDF and search of "ext": "png". I tried saving all the images and looking one by one. I tried also searching into the text, but i couldn't find those bullets.

Santiago
  • 25
  • 3
  • Im extractning the PDF as plain text(text part) and the images as bytestrings. the problem is that when i print the lines, none of them have the bullets. Also when i print the images with PIL module(render and show the image), none of them are bullets. The funny fact is that when i try to select the bulletpoint manually in the PDF, it can't select but if i copy and paste(even not showing the selecion) it pastes (image.png). – Santiago Mar 15 '23 at 14:34
  • 1
    In some PDFs, bullet points are no text at all, but also are no "conventional" image (identified via an xref). They often are vector graphics (drawn as a filled circle), or hey may be inline images of the page (= only contained / known inside the page's `/Contents`). You can extract both objects with PyMuPDF, but you have some way to go to associate a result with its respective text. – Jorj McKie Mar 15 '23 at 14:41
  • Thanks @JorjMcKie, you're in the right way. That's exactly what im searching for. So now i need to try to find those contents with PyMuPDF. – Santiago Mar 15 '23 at 15:20
  • Also, thanks for your asnwers @KJ, i just can't provide the PDF because of the information D: – Santiago Mar 15 '23 at 15:22
  • Is there a way i can extract the numbers instead of text with PyMuPDF? – Santiago Mar 15 '23 at 15:22
  • 1
    @Santiago to extract **inline images**, you can simply use `page.get_text("dict")` like you would for text itself. Images (inline ones and others) occur as **image blocks** in this output. To detect bullets of this image category, filter out image blocks with a bbox in the range of text fontsize(s). To detect / extract bullets that are vector graphics, use the output of `page.get_drawings()`. Again, filter out those paths (dict items in the list returned by that method) with a rectangle in range of fontsize. – Jorj McKie Mar 15 '23 at 17:09
  • @KJ you lost me with the "numbers": what are they? Glyph numbers? – Jorj McKie Mar 15 '23 at 17:16
  • @JorjMcKie YOURE AWESOME MAN!!! Thank you very much, get_drawing solved my problem. Write an answer so i can accept it as correct answer. – Santiago Mar 16 '23 at 18:38

1 Answers1

1

Example snippet for filtering out small vector graphic items on a page, that are used as bullet points:

import fitz  # the PyMuPDF package

doc = fitz.open("input.pdf")
page = doc[pno]  # page with number 'pno'
paths = page.get_drawings()  # vector graphics on page
bullets = []  # bullet point graphics
for path in paths:
    rect = path["rect"]  # rectangle containing the graphic
    # filter out if width and height are both less than font size
    if rect.width <= fontsize and rect.height <= fontsize:
        bullets.append(path)

More detailed filtering is possible where relevant like these examples:

  • check for graphics having a fill color path["type"] in ("f", "fs")
  • check if the fill color is black path["fill"] == (0, 0, 0) (all colors are standardized to the RGB colorspace)
  • check if wrapping rectangle is a square
  • (approximate) check if the item is a circle bullet by ensuring it consists of 4 connected curves
  • etc.

To find the bullet for the bbox of a text, look for an item in bullets having item["rect"].y0 >= bbox.y0 and item["rect"].y1 <= bbox.y1 and item["rect"].x1 <= bbox.x0.

Which means the item is to the left of bbox and fits in the line's "stripe".

Jorj McKie
  • 2,062
  • 1
  • 13
  • 17