While storing pdf text in csv how to avoid spreading text to multiple row

Question

I am storing pdf text (extracted with pypdf) in a CSV file. the problem is few pdf file is very long and the text spreads into multiple rows for those long pdf file instead of keeping a single row. How to keep them in a single row? here my output looks like

column1    column2            
 long pdf    hello my
             name is jhone
 short pdf   hello my name is jhone. I haven't any problem for short pdf file

my code:

pdf_url ='https://www.snb.ch/en/mmr/speeches/id/ref_20230330_amrtmo/source/ref_20230330_amrtmo.en.pdf'
print("pdf_url: ",pdf_url)
   
# Download the PDF file from the URL
response = requests.get(pdf_url)

# Create an in-memory buffer from the PDF content
pdf_buffer = io.BytesIO(response.content)

# Read the PDF file from the in-memory buffer
pdf = PdfReader(pdf_buffer)
pdf_content = []
# Access the contents of the PDF file
for page_num in range(len(pdf.pages)):
    page = pdf.pages[page_num]
    page = str(page.extract_text())
    pdf_content.append(page)
    
   
   
with open(filename, "a", newline="",  encoding='utf8') as f:
        writer = csv.writer(f)
        writer.writerow([first_author, new_date_str, speech_title,pdf_url,pdf_content])

pdf_content.clear()

Hi ! Have you tried removing the newlines from the pdf text ? Additionally, could you please provide a runnable code ? As it is, your code can't be run, so it's very difficult to help. — Hoodlum, May 09 '23 at 09:11
Your code contains no imports, so I have no idea what modules you're using (I can guess, but that's not ideal, since I may be wrong). Plus, `filename`, `first_author`, `new_date_str` and `speech_title` are not defined. :/ — Hoodlum, May 09 '23 at 09:17
@Hoodlum they are defined. here given my partial code and I don't want to paste here my thousand lines of code — boyenec, May 09 '23 at 11:58
The reason I ask for a runnable code is so that I can run it to try and help you. If I have to guess how to make your example runnable, that's a bit of a waste of time and can introduce new bugs or misunderstandings. You don't need to paste thousands of lines of code - a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) is enough. — Hoodlum, May 09 '23 at 12:11
you can just remove undefined variable [first_author, new_date_str, speech_title] and code will be working. My partial code also 100% runnable — boyenec, May 09 '23 at 16:01

Hoodlum · Answer 1 · 2023-05-09T09:48:16.700

1

It looks like this might be a limitation of your CSV reader, rather than a problem in your script: if you're using MS Excel (like I am), you'll find it has a maximum number of characters it can store per cell (see the specs )

When I check the length of the last string in the line, I find it to be just under this limit. Why this results in a new line is unclear to me though.

Work Around

To get around this limitation (in Excel), you can use the 'From CSV' option to explicitly tell excel to import the data as a table. This should then display correctly.

edited May 09 '23 at 09:48

answered May 09 '23 at 09:37

Hoodlum

950
2
13

can u show me how to use From CSV? – boyenec May 09 '23 at 11:56
The easiest way is to type '*From CSV*' in the search bar in Excel. Otherwise, you'll find it under *Data > Get & Transform Data > FromText/CSV*. This opens a wizard. Then open your CSV file here. – Hoodlum May 09 '23 at 12:02

While storing pdf text in csv how to avoid spreading text to multiple row

1 Answers1

Work Around