Is thre any solution to extract borderless table from PDF to CSV?

Question

Sample table

This is my example image from pdf file with 75 pages.

Please read the Code of Conduct: https://stackoverflow.com/conduct on how to ask a question. What have you tried so far? What did you do that went wrong? — jalazbe, Jun 08 '20 at 07:53

Gustav Rasmussen · Answer 1 · 2020-06-08T08:08:28.193

You can do this with Python and the tabula module. Since it is borderless, you can first find the area dynamically with my get_area function (modify pages number etc.):

from tabula import convert_into, convert_into_by_batch, read_pdf
from tabulate import tabulate

def get_area(file):
    """Set and return the area from which to extract data from within a PDF page
    by reading the file as JSON, extracting the locations
    and expanding these.
    """
    tables = read_pdf(file, output_format="json", pages=2, silent=True)
    top = tables[0]["top"]
    left = tables[0]["left"]
    bottom = tables[0]["height"] + top
    right = tables[0]["width"] + left
    # print(f"{top=}\n{left=}\n{bottom=}\n{right=}")
    return [top - 20, left - 20, bottom + 10, right + 10]

Before conversion, check that the format of your first table looks correct:

def inspect_1st_table(file: str):

    df = read_pdf(
        file,
        # output_format="dataframe",
        multiple_tables=True,
        pages="all",
        area=get_area(file),
        silent=True,  # Suppress all stderr output
    )[0]
    print(tabulate(df.head()))

Then, use the area to do your table extraction, from pdf to csv:

def convert_pdf_to_csv(file: str):
    """Output all the tables in the PDF to a CSV"""
    convert_into(
        file,
        file[:-3] + "csv",
        output_format="csv",
        pages="all",
        area=get_area(file),
        silent=True,
    )

In case you need to extract more than 1 table, again start by inspecting them:

def show_tables(file: str):
    """Read pdf into list of DataFrames"""
    tables = read_pdf(
        file, pages="all", multiple_tables=True, area=get_area(file), silent=True
    )
    for df in tables:
        print(tabulate(df))

And to a batch conversion of all pdf tables to csv format:

def convert_batch(directory: str):
    """convert all PDFs in a directory"""
    convert_into_by_batch(directory, output_format="csv", pages="all", silent=True)

Things to play around with are the options: guess=False, stream=True, and the pages argument. — Gustav Rasmussen, Jun 08 '20 at 08:14

score 0 · Answer 2 · answered Jun 08 '20 at 08:20

0

Camelot is a great option for extracting borderless tables. You can use the flavour = stream option for extraction.

tables = camelot.read_pdf('sample.pdf', flavor='stream', edge_tol=500, pages='1-end')

#tables from all your pages will be stored in the tables object
tables[0].df

df.to_csv()

answered Jun 08 '20 at 08:20

pykam

1,223
6
16

could you please be more elaborate how can I achieve this goal.because it fetches other text data as as well.Thank you so much for your support. – DataEngineer_Developer Jun 08 '20 at 15:25
If it fetches extra data, you can do some data cleaning steps on the data frame that is returned. None of the text extractors are 100% accurate. – pykam Jun 09 '20 at 08:42

Is thre any solution to extract borderless table from PDF to CSV?

2 Answers2