0

pdf linkI have been trying to use the Camelot library and trying to capture a table (that isn't really formatted as a table) by setting the flavor parameter to 'stream'. However, it is not detecting the entire table. So what I decided to do is try to detect the entire page by feeding it an area parameter that takes the pages dimensions as inputs.

I have tried using this code but it still does not give me the whole page dimensions.

import camelot
from matplotlib import pyplot as plt
import pandas as pd
import PyPDF2

pdf_file = open(r'C:\Users\PC\PycharmProjects\finstate.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
page = pdf_reader.getPage(10)
width = page.mediaBox.getWidth()
height = page.mediaBox.getHeight()
print("Width:", width)
print("Height:", height)

page_area = [0, 0, 0, 0]
pdf = camelot.read_pdf(r'C:\Users\PC\PycharmProjects\finstate.pdf', pages='0-10', flavor='stream', area=page_area)
first_table = pdf[10]

print(first_table.df)
first_table.to_csv(r'C:\Users\PC\Desktop\table.csv')
Jagwire
  • 1
  • 1
  • Can you provide the PDF? ***stream*** parsing method should be used to extract tables without borders. Is your table a table without any border between its cells and has sufficient margin between the content of cells? If not, then you can try ***lattice*** as parsing method to extract table in better way. – Said Akyuz Jan 31 '23 at 13:35
  • Thank you for replying back! I have edited my post in the beginning to include the file. – Jagwire Feb 03 '23 at 02:57
  • I guess I put the wrong link but any of the 2 reports work, so the page is irrelevant. Any page with a table in it is what causes me problems. I am not an expert at python so I did not really understand your explanation, but thanks for your help! – Jagwire Feb 04 '23 at 01:39
  • Thanks for sharing the pdf. As I see, your flavor choice is ok for your tables. So forget about my comment and I will try to write an answer wich could be the solution – Said Akyuz Feb 07 '23 at 08:21

1 Answers1

0

To improve the detected area, you can increase the edge_tol (default: 50) value to counter the effect of text being placed relatively far apart vertically. Larger edge_tol will lead to longer text edges being detected, leading to an improved guess of the table area. Let’s use a value of 500.

You can try the following code. If it doesn't work play with edge_tol;

tables = camelot.read_pdf(r'C:\Users\PC\PycharmProjects\finstate.pdf', pages='0-10', flavor='stream', edge_tol=500)

And the following code snippet could be helpful how your table detected is;

camelot.plot(tables[0], kind='contour').show()
Said Akyuz
  • 180
  • 1
  • 1
  • 11
  • Hey thank you for replying Said. I am kind of stuck, why does'nt the code work import camelot from matplotlib import pyplot as plt import pandas as pd import PyPDF2 pdf_file = open(r'C:\Users\PC\PycharmProjects\finstate.pdf', 'rb') pdf_reader = PyPDF2.PdfFileReader(pdf_file) pdf = camelot.read_pdf(r'C:\Users\PC\PycharmProjects\finstate.pdf', pages='0-10', flavor='stream', edge_tol=500) camelot.plot(pdf, kind='contour').show() – Jagwire Feb 08 '23 at 02:54
  • What is the Error message? – Said Akyuz Feb 08 '23 at 08:15
  • Pls, provide code surrounded in backticks "```" – Said Akyuz Feb 08 '23 at 08:16