0

Given a pdf(attached) with table row splitted across multiple pages with page break in between. I am trying to extract tabular data in a csv from this pdf using pdfplumber, but am getting this data in separate rows in a csv. Basically I would like to get this data in a single row.

With pdfplumber, is there a way to identify if the row has a horizontal border or not? If this information is available, it could help in merging the rows.

In the attached image, grey colour coded are the cells content.

enter image description here

jsanjayce
  • 272
  • 5
  • 15
  • There is no fail happening. Issue is just that, when I am extracting the tabular data in a csv, as both the page table are being considered as a different row, I am getting data in 2 separate rows instead they belong to a single row. With post processing, we can merge both the records and get a single record. But, I am interested in more generic solution which can handle other scenarios when the row isn't splitted across page. – jsanjayce Nov 28 '22 at 11:03
  • Got it. Thanks for the details. Is there a way to identify if the first part of the row in the first page has a bottom edge or not? and similarly top edge for the next part of the row in the next page ? If we could somehow identify that, it would also be an indicator that this belongs to a single row. Appreciate your help. Thanks – jsanjayce Nov 28 '22 at 12:41

1 Answers1

0

pdfplumber objects have a top (distance of the top of the character from the top of the page). You can leverage it to know if the last page ends without a border and the first page starts without a border.

If the top value of the bottommost character is more than the top value of the bottommost horizontal line, then it means that the page is ending without the table border. Similarly, if the top of the topmost character is lower than the top of the topmost horizontal line, then it means that the page is starting without the table border. Combining the two, you can deduce whether to merge the rows or not.

Samkit Jain
  • 1,560
  • 2
  • 16
  • 33