extracting data into columns using pdfplumber

Question

I have a pdf which has data in tabular format and has 6 columns but the columns are not separated by boundaries so when I extract the data using pdfplumber, all the data comes in one cell only and I want in separate cells.

How could I do that?

For your reference:

15/03/2021 RTGS-UTIBR52021031300662458-VIRENDER KUMAR 2,60,635.00 2,94,873.94Cr 11/03/2021 IMPS/P2A/107018040382/XXXXXXXXXX0980/trf 49,500.00 34,238.94Cr 11/03/2021 IMPS/P2A/107018771795/KINGDOMHOTELAND/trf 35,000.00 83,738.94Cr

Thanks in advance

score 0 · Answer 1 · answered Dec 14 '22 at 09:19

0

You can use the extract_tables() method the get the tables into the Data frame.

Here I can just mention the code for the 0th page.use the for loop to extract table from the all the pages.

import pdfplumber
path = file_path

pdf = pdfplumber.open(path,password="")
table = pd.DataFrame(pdf.pages[0].extract_tables())

Change the code as per your requirements.

answered Dec 14 '22 at 09:19

jainam shah

199
1
11

I attempted, but it did not work. – arvin Dec 14 '22 at 11:36
@arvin Can you share the pdf for reference??. – jainam shah Dec 15 '22 at 06:41
https://i.postimg.cc/pL73r3GS/1.png – arvin Dec 15 '22 at 07:56

extracting data into columns using pdfplumber

1 Answers1