This is my example image from pdf file with 75 pages.
Asked
Active
Viewed 5,192 times
1
-
Please read the Code of Conduct: https://stackoverflow.com/conduct on how to ask a question. What have you tried so far? What did you do that went wrong? – jalazbe Jun 08 '20 at 07:53
2 Answers
1
You can do this with Python and the tabula module. Since it is borderless, you can first find the area dynamically with my get_area function (modify pages number etc.):
from tabula import convert_into, convert_into_by_batch, read_pdf
from tabulate import tabulate
def get_area(file):
"""Set and return the area from which to extract data from within a PDF page
by reading the file as JSON, extracting the locations
and expanding these.
"""
tables = read_pdf(file, output_format="json", pages=2, silent=True)
top = tables[0]["top"]
left = tables[0]["left"]
bottom = tables[0]["height"] + top
right = tables[0]["width"] + left
# print(f"{top=}\n{left=}\n{bottom=}\n{right=}")
return [top - 20, left - 20, bottom + 10, right + 10]
Before conversion, check that the format of your first table looks correct:
def inspect_1st_table(file: str):
df = read_pdf(
file,
# output_format="dataframe",
multiple_tables=True,
pages="all",
area=get_area(file),
silent=True, # Suppress all stderr output
)[0]
print(tabulate(df.head()))
Then, use the area to do your table extraction, from pdf to csv:
def convert_pdf_to_csv(file: str):
"""Output all the tables in the PDF to a CSV"""
convert_into(
file,
file[:-3] + "csv",
output_format="csv",
pages="all",
area=get_area(file),
silent=True,
)
In case you need to extract more than 1 table, again start by inspecting them:
def show_tables(file: str):
"""Read pdf into list of DataFrames"""
tables = read_pdf(
file, pages="all", multiple_tables=True, area=get_area(file), silent=True
)
for df in tables:
print(tabulate(df))
And to a batch conversion of all pdf tables to csv format:
def convert_batch(directory: str):
"""convert all PDFs in a directory"""
convert_into_by_batch(directory, output_format="csv", pages="all", silent=True)

Gustav Rasmussen
- 3,720
- 4
- 23
- 53
-
Things to play around with are the options: guess=False, stream=True, and the pages argument. – Gustav Rasmussen Jun 08 '20 at 08:14
0
Camelot is a great option for extracting borderless tables. You can use the flavour = stream option for extraction.
tables = camelot.read_pdf('sample.pdf', flavor='stream', edge_tol=500, pages='1-end')
#tables from all your pages will be stored in the tables object
tables[0].df
df.to_csv()

pykam
- 1,223
- 6
- 16
-
could you please be more elaborate how can I achieve this goal.because it fetches other text data as as well.Thank you so much for your support. – DataEngineer_Developer Jun 08 '20 at 15:25
-
If it fetches extra data, you can do some data cleaning steps on the data frame that is returned. None of the text extractors are 100% accurate. – pykam Jun 09 '20 at 08:42