0

I've been using tabula-py, PyPDF2 and tika modules, but none of them seems to detect the background color of a table cell, which is within a PDF file.

These colored cells mean important information in the context of my problem. I know, for exemple, that tabula-py is a wrapper from tabula-java and this one does not provided colored cell information. Is there some easy-to-follow solution in Python out there?

Thanks in advance.

2 Answers2

1

disclaimer: I am the author of the library borb used in this answer

about PDF: PDF is not so much a "what you see is what you get" format, as it is a container for rendering instructions. That means a table is in fact just a collection of rendering instructions that draws something we humans interpret as a table. Something like:

  • go to location x, y
  • set the current stroke colour to black
  • set the current fill colour to blue
  • set the font to Helvetica, size 12
  • draw a line to x, y
  • move the pen up
  • go to x, y
  • render the string "Hello World"

Whenever a PDF library is extracting tables from a PDF, it's important to keep in mind this is a heuristic. It's based on some assumptions. Such as "tables tend to have a large number of lines that intersect at 90-degree angles".

I suggest you have a look at TableDetectionByLines in borb. It's a class that gathers the aforementioned rendering instructions and spits out the locations of tables in the PDF document.

You would use it as such:

from borb.pdf.canvas.layout.table.table import Table, TableCell
from borb.pdf.document.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
from borb.toolkit.table.table_detection_by_lines import TableDetectionByLines

doc: typing.Optional[Document] = None
with open(input_file, "rb") as input_pdf_handle:
    l: TableDetectionByLines = TableDetectionByLines()
    doc = PDF.loads(input_pdf_handle, [l])

assert doc is not None
tables: typing.List[Table] = l.get_tables_for_page(0)

As it stands, this class does not track the stroke/fill colour. But you can easily subclass it, and modify it so it does.

For this, I would start at this particular line.

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54
0

I found a solution using pdfplumber. Here is rough sample code.

from typing import Optional

import pdfplumber
from pdfplumber.page import Page, Table


def cmyk_to_rgb(cmyk: tuple[float, float, float, float]):
    r = 255 * (1.0 - (cmyk[0] + cmyk[3]))
    g = 255 * (1.0 - (cmyk[1] + cmyk[3]))
    b = 255 * (1.0 - (cmyk[2] + cmyk[3]))
    return r, g, b


def to_bbox(rect: dict) -> tuple[float, float, float, float]:
    return (rect["x0"], rect["top"], rect["x1"], rect["bottom"])


def is_included(cell_box: tuple[float, float, float, float], rect_box: tuple[float, float, float, float]):
    c_left, c_top, c_right, c_bottom = cell_box
    r_left, r_top, r_right, r_bottom = rect_box
    return c_left >= r_left and c_top >= r_top and c_right <= r_right and c_bottom <= r_bottom


def find_rect_for_cell(cell: tuple[float, float, float, float], rects: list[dict]) -> Optional[dict]:
    return next((r for r in rects if is_included(cell, to_bbox(r))), None)


def get_cell_color(cell: tuple[float, float, float, float], page: Page) -> tuple[float, float, float]:
    rect = find_rect_for_cell(cell, page.rects) if cell else None
    return cmyk_to_rgb(rect["non_stroking_color"]) if rect else (255, 255, 255)


pdf = pdfplumber.open("/path/to/target.pdf")
page = pdf.pages[0]
tables: list[Table] = page.find_tables()

# get RGB color of first(= top-left) cell of first table
print(get_cell_color(tables[0].rows[0].cells[0], page)) # => (r, g, b)
toshi
  • 2,757
  • 1
  • 16
  • 12