0

I am new to pandas. I have a PDF data table which I read and extract the data from, converting it into an Excel file. This is a short, fictitious example of the table I am using, however the structure of the PDF is exactly the same:

table.pdf

After converting the PDF to Excel, I get this: The table is complete and the conversion was successfully completed. excel file

Notice that each word within the single quotes, are a column within my original PDF. The commas, are some empty columns that are in the PDF.

I would like to know how I can turn those words into columns again, using pandas? I want to get a result similar to this: final excel

My simple code is below:

import pdfplumber
import pandas as pd
from pandas import ExcelWriter

with pdfplumber.open('Sistema BanPara.pdf') as pdf:
    df = pd.DataFrame(page.extract_table() for page in pdf.pages).astype(str)
with pd.ExcelWriter('table.xlsx') as writer:
    df.to_excel(writer)
  • Can you try `df = pd.concat(pd.DataFrame(page.extract_table()) for page in pdf.pages)` instead? Currently you're passing all the tables at once to a single `pd.DataFrame()` call which is why you're ending up with nested lists. – jqurious Jul 14 '23 at 20:59
  • Rather than processing via Excel see [link](https://pypi.org/project/tabula-py/) for a module that forms Pandas DFs direct from PDF files. – user19077881 Jul 15 '23 at 07:02

0 Answers0