Python & Pandas: combining multiple rows into single cell

Question

I'm writing a script that extracts text from a pdf file and inserts it as a string into a single csv row. Using pdfplumbr I can successfully extract the text, with each page's text inserted into the csv as an individual row. However, I'm struggling to figure out how to combine those rows into a single cell. I'm attempting Pandas pd.concat function to combine them, but so far without success.

Here's my code:

import pdfplumber
import pandas as pd
import csv

file1 = open("pdf_texts.csv", "w", newline="")
file2 = open("pdf_text_pgs.csv", "w", newline="")
writer2 = csv.writer(file2)
headers = ['text']

with pdfplumber.open('target.pdf') as pdf:
    pdf_length = len(pdf.pages)

    writer2.writerow(headers)

    for page_number in range(0, pdf_length):
        pdf_output = pdf.pages[page_number]
        pdf_txt = pdf_output.extract_text().encode('UTF-8')
        writer2.writerow([pdf_txt])

    # this is my attempt for pd.concat
    df  = pd.read_csv("pdf_text_pgs.csv", 'r')
    df_txts = df['text']
    pdf_txt_df = pd.concat([df_txts], axis=0, ignore_index=True)
    pdf_txt_df.to_csv('pdf_texts.csv', header=False, index=False)

However, the final output fails to combine the rows, and worse yet seems to lose the final row. Any suggestions on how to approach this? All help gratefully appreciated.

Can you provide a link to the PDF file so the script can be tested? — Martin Evans, Nov 10 '21 at 13:41
Here you go - [link](http://danielhutchinson.org/research/files/original/14639cd8a5271d38989ead748a8b7141b05acfc3.pdf) - many thanks — Daniel Hutchinson, Nov 10 '21 at 14:18
That PDF does not appear to have any text, only images of text? (`.extract_text()` returns `None` for each page) — Martin Evans, Nov 10 '21 at 15:35
Apologies - try this [pdf](http://danielhutchinson.org/research/files/original/8b242903afe391294f889abd4182496185f99af0.pdf) instead. — Daniel Hutchinson, Nov 10 '21 at 15:47

score 1 · Accepted Answer · answered Nov 10 '21 at 15:54

You would just need to store the text from each page in a list and combine it all at the end. For example:

import pdfplumber
import csv

with pdfplumber.open('target.pdf') as pdf, \
     open("pdf_text_pgs.csv", "w", newline="", encoding="utf-8") as f_output:

    csv_output = csv.writer(f_output)
    csv_output.writerow(['text'])

    text = []
    
    for page in pdf.pages:
        extracted_text = page.extract_text()
        
        if extracted_text:  # skip empty pages or pages with images
            text.append(extracted_text)
        
    csv_output.writerow([' '.join(text)])

Many thanks, tremendously helpful! – Daniel Hutchinson Nov 10 '21 at 16:35 — Daniel Hutchinson, Nov 10 '21 at 16:35

Python & Pandas: combining multiple rows into single cell

1 Answers1