Python: Parse through extracted lines

Question

So i am trying to work on scrapping using PDFplumber and want to extract the text from this PDF and covert it into an excel (with each value - like the Expense apart from the numbers- in its own cell).

I started a bit of the code and was successfully able to break it down by line!! so each line prints on its own but now im having trouble splitting the numbers into their own "column" EX: Retail Rent 444,335.40 75.12 444,335.40 75.12

should become Retail Rent | 444,335.40 | 75.12 | 444,335.40 | 75.12

import pdfplumber

def extracted_lines_from_pdf(pdf_path): 
    extracted_lines = []

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            lines = page.extract_text().split('\n')
            extracted_lines.extend(lines)

    return extracted_lines

pdf_path = 'Sample_Numbers.pdf'
lines = extracted_lines_from_pdf(pdf_path)

for line in lines:
    print(line +'\n')

I have it in an list right now and want to essentially have a list i guess? ( i was hoping i could do it in a way where it would recognize "if there is a letter, this is the expense and if a number follows, the first number goes with "period to date", second is "%", etc.... if there is no number, go to the next line" (OUTPUT SO FAR IS BELOW)

I have the same structure: string, number, number, etc which are separated by spaces the end goal is to convert this to excel and have each entity in its own cell essentially

haha @SimonUnderwood I meant list, just changed it and u caught it lol — megarocker241, May 23 '23 at 21:31
For each line, you can use `words = line.split()` to split a line by space. Then you can use a [match statement](https://www.geeksforgeeks.org/python-match-case-statement/) on the resulting words to parse the line. — SimonUnderwood, May 23 '23 at 21:31
@CaptainCaveman so right now i have the all the text in a list so it shows up like ['Sample for UC Berkeley', 'Period = Jan 2022-Dec 2022', 'Book = EXAMPLE', 'Month % Year %', 'REVENUE', 'RENTINCOME', 'Rent 444.40 75.12 444.40 75.12', 'Other Income 500.00 0.08 500.00 0.08'] — megarocker241, May 23 '23 at 21:32
@CaptainCaveman So when i print(lines) i get ['Sample for UC Berkeley', 'Period = Jan 2022-Dec 2022', 'Book = EXAMPLE', 'Month % Year %', 'REVENUE', 'RENTINCOME', 'Rent 444.40 75.12 444.40 75.12', 'Other Income 500.00 0.08 500.00 0.08'] so hoping it comes out ['Sample for UC Berkeley'] [Period , Jan 2022, Dec 2022], [Month % Year %], [REVENUE], [RENTINCOME, 'Rent, 444.40 ,75.12 ,444.40 ,75.12] something like that — megarocker241, May 23 '23 at 21:38
@CaptainCaveman i eventually will use df_path.to_excel('file_path/file.xlsx') — megarocker241, May 23 '23 at 21:41
@CaptainCaveman oops... i guess yes (sorry very bad with terminology... im learning) — megarocker241, May 23 '23 at 21:43
Basically ['Sample for UC Berkeley'] will be its own cell, and then 'RENTINCOME', 'Rent 444.40 75.12 444.40 75.12' will be [RENTINCOME, 'Rent] *in its own cell [444.40] in a new cell ,[ 75.12 ] in a new cell [444.40] in a new cell — megarocker241, May 23 '23 at 21:49
You're aware that there are lots of online tools and Python libraries (e.g. PDFplumber which you use, Tabula, etc.) that claim they can parse tables in PDFs? Perhaps you don't need to do all the heavy lifting yourself. — Joooeey, May 29 '23 at 19:17

Captain Caveman · Accepted Answer · 2023-05-24T13:28:37.137

If you want to convert the list to a dataframe and export it as an xlsx file, this is one approach.

import pdfplumber
import pandas as pd

def extracted_lines_from_pdf(pdf_path): 
    extracted_lines = []

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            lines = page.extract_text().split('\n')
            extracted_lines.extend(lines)

    return extracted_lines

pdf_path = 'Sample_Numbers.pdf'
lines = extracted_lines_from_pdf(pdf_path)

data = [[item] for item in lines]

df = pd.DataFrame(data)

df.to_excel('output.xlsx', index=False, header=False)

Output:

Python: Parse through extracted lines

1 Answers1