0

I need to obtain all the information of a pdf in lists or arrangements; but this library generates this error and there is no way to solve it.

with pdfplumber.open(file) as temp:
def check_bboxes(word, table_bbox):
    """
    Check whether word is inside a table bbox.
    """
    l = word['x0'], word['top'], word['x1'], word['bottom']
    r = table_bbox
    return l[0] > r[0] and l[1] > r[1] and l[2] < r[2] and l[3] < r[3]
page = temp.pages[3]
tables = page.find_tables()
table_bboxes = [i.bbox for i in tables]
tables = [{'table': i.extract(), 'doctop': i.bbox[1]} for i in tables]
non_table_words = [word for word in page.extract_words() if not any(
[check_bboxes(word, table_bbox) for table_bbox in table_bboxes])]
lines = []
for cluster in pdfplumber.utils.cluster_objects(non_table_words+tables, 'doctop', tolerance=5):
    if 'text' in cluster[0]:
        lines.append(' '.join([i['text'] for i in cluster]))
    elif 'table' in cluster[0]:
        lines.append(cluster[0]['table'])

enter image description here

  • the error is originating from `pdfplumber.utils.cluster_objects(non_table_words+tables, 'doctop', tolerance=5)` because cluster_objects is expecting ` xs: List[R], key_fn: Union[Hashable, Callable[[R], T_num]], tolerance: T_num )` specifically that key_fn is trying to be a function but you maybe not passing a function in. – Andrew Ryan Oct 03 '22 at 05:43

0 Answers0