0

So i am trying to work on scrapping using PDFplumber and want to extract the text from this PDF and covert it into an excel (with each value - like the Expense apart from the numbers- in its own cell).

I started a bit of the code and was successfully able to break it down by line!! so each line prints on its own but now im having trouble splitting the numbers into their own "column" EX: Retail Rent 444,335.40 75.12 444,335.40 75.12

should become Retail Rent | 444,335.40 | 75.12 | 444,335.40 | 75.12

import pdfplumber

def extracted_lines_from_pdf(pdf_path): 
    extracted_lines = []

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            lines = page.extract_text().split('\n')
            extracted_lines.extend(lines)

    return extracted_lines

pdf_path = 'Sample_Numbers.pdf'
lines = extracted_lines_from_pdf(pdf_path)

for line in lines:
    print(line +'\n')

I have it in an list right now and want to essentially have a list i guess? ( i was hoping i could do it in a way where it would recognize "if there is a letter, this is the expense and if a number follows, the first number goes with "period to date", second is "%", etc.... if there is no number, go to the next line" (OUTPUT SO FAR IS BELOW)

I have the same structure: string, number, number, etc which are separated by spaces the end goal is to convert this to excel and have each entity in its own cell essentially


  • Arrays don't exist in python. `[]` is a list. – SimonUnderwood May 23 '23 at 21:29
  • haha @SimonUnderwood I meant list, just changed it and u caught it lol – megarocker241 May 23 '23 at 21:31
  • For each line, you can use `words = line.split()` to split a line by space. Then you can use a [match statement](https://www.geeksforgeeks.org/python-match-case-statement/) on the resulting words to parse the line. – SimonUnderwood May 23 '23 at 21:31
  • @CaptainCaveman so right now i have the all the text in a list so it shows up like ['Sample for UC Berkeley', 'Period = Jan 2022-Dec 2022', 'Book = EXAMPLE', 'Month % Year %', 'REVENUE', 'RENTINCOME', 'Rent 444.40 75.12 444.40 75.12', 'Other Income 500.00 0.08 500.00 0.08'] – megarocker241 May 23 '23 at 21:32
  • @CaptainCaveman So when i print(lines) i get ['Sample for UC Berkeley', 'Period = Jan 2022-Dec 2022', 'Book = EXAMPLE', 'Month % Year %', 'REVENUE', 'RENTINCOME', 'Rent 444.40 75.12 444.40 75.12', 'Other Income 500.00 0.08 500.00 0.08'] so hoping it comes out ['Sample for UC Berkeley'] [Period , Jan 2022, Dec 2022], [Month % Year %], [REVENUE], [RENTINCOME, 'Rent, 444.40 ,75.12 ,444.40 ,75.12] something like that – megarocker241 May 23 '23 at 21:38
  • @CaptainCaveman i eventually will use df_path.to_excel('file_path/file.xlsx') – megarocker241 May 23 '23 at 21:41
  • @CaptainCaveman oops... i guess yes (sorry very bad with terminology... im learning) – megarocker241 May 23 '23 at 21:43
  • Basically ['Sample for UC Berkeley'] will be its own cell, and then 'RENTINCOME', 'Rent 444.40 75.12 444.40 75.12' will be [RENTINCOME, 'Rent] *in its own cell [444.40] in a new cell ,[ 75.12 ] in a new cell [444.40] in a new cell – megarocker241 May 23 '23 at 21:49
  • cleaned up comments. – Captain Caveman May 23 '23 at 22:12
  • You're aware that there are lots of online tools and Python libraries (e.g. PDFplumber which you use, Tabula, etc.) that claim they can parse tables in PDFs? Perhaps you don't need to do all the heavy lifting yourself. – Joooeey May 29 '23 at 19:17

1 Answers1

0

If you want to convert the list to a dataframe and export it as an xlsx file, this is one approach.

import pdfplumber
import pandas as pd

def extracted_lines_from_pdf(pdf_path): 
    extracted_lines = []

    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            lines = page.extract_text().split('\n')
            extracted_lines.extend(lines)

    return extracted_lines

pdf_path = 'Sample_Numbers.pdf'
lines = extracted_lines_from_pdf(pdf_path)

data = [[item] for item in lines]

df = pd.DataFrame(data)

df.to_excel('output.xlsx', index=False, header=False)

Output:

enter image description here

Captain Caveman
  • 1,448
  • 1
  • 11
  • 24