extract text from pdf File from S3 bucket python

Question

I have multiple format files in my AWS s3 bucket like pdf,doc,rtf,odt,png and I need to extract text from it. I have managed to get the list of contents with their path .now depending on the file type i will use different libraries to extract text from the file . since files can be in thousands i need to extract text directly from s3 instead of downloading.

filespath=['https://abc.s3.ap-south-1.amazonaws.com/DocumentOnPATest', 'https://abc.s3.ap-south-1.amazonaws.com/IndustryReport2019.pdf', 'https://abc.s3.ap-south-1.amazonaws.com/receipt.png', 'https://abc.s3.ap-south-1.amazonaws.com/sample.rtf', 'https://abc.s3.ap-south-1.amazonaws.com/sample1.odt']

bucketname =abc

I tried something but its giving me error

for path in filespath:
    ext=pathlib.Path(path).suffix
    if ext=='.pdf':
       pdf_file=PyPDF2.PdfFileReader(path)
       print(pdf_file.extractText())

but i am getting an error

  File "F:\Projects\FileExtractor\fileextracts3.py", line 28, in <module>
    pdf_file=PyPDF2.PdfFileReader(path)

  File "C:\ProgramData\Anaconda3\lib\site-packages\PyPDF2\pdf.py", line 1081, in __init__
    fileobj = open(stream, 'rb')

OSError: [Errno 22] Invalid argument: 'https://abc.s3.ap-south-1.amazonaws.com/IndustryReport2019.pdf

please help me with the lead. Thank you

"since files can be in thousands i need to extract text directly from s3 instead of downloading" -- you can not operate on data locally unless you download that data. This doesn't mean you have to write it to a file and it doesn't mean you have to keep it after finishing processing either. — Ulrich Eckhardt, Jan 19 '21 at 08:15

Krishna Chaurasia · Answer 1 · 2021-01-19T11:40:38.367

0

PyPDF2 does not support reading from s3 directly. You'll need to download them first locally.

~~or you can try using [AWS Lambda functions][1] to process files directly from s3 buckets.~~

edited Jan 19 '21 at 11:40

answered Jan 19 '21 at 08:11

Krishna Chaurasia

8,924
6
22
35

sorry but i didnt understand it .can u explain me in a simple terms . do i need to create a lambda function in my aws lambda if yes then how can it be applied on multiple files .@Krishna Chaurasia – Jan 19 '21 at 09:41
Actually, based on the link in the answer, I think even with lambda functions, you'll have to download the files from s3 before processing it. I'll edit the answer to remove the second part. – Krishna Chaurasia Jan 19 '21 at 11:38

RodP · Answer 2 · 2022-02-12T22:03:44.883

0

You could try the boto3 solution here, provided by Justin Leto. You would still need a way of reading/converting the file stream for each file type but the PDF answer is there.

import boto3
s3 = boto3.resource('s3')
obj = s3.Object(bucket_name, itemname)
fs = obj.get()['Body'].read()

edited Feb 12 '22 at 22:03

answered Feb 12 '22 at 21:56

RodP

382
2
12

extract text from pdf File from S3 bucket python

2 Answers2