0

Is there any way to extract text and documentInfo from PDF file uploaded via Google app engine? I want to use PyPDF2, and my code is this:

pdf_file = self.request.POST['file'].file
pdf_reader = pypdf.PdfFileReader(pdf_file)

This gives me error:

Traceback (most recent call last):
....
  File "/myrepo/myproj/main.py", line 154, in post
    pdf_text = pypdf.PdfFileReader(pdf_file)
  File "lib/PyPDF2/pdf.py", line 649, in __init__
    self.read(stream)
  File "lib/PyPDF2/pdf.py", line 1100, in read
    raise utils.PdfReadError, "EOF marker not found"
PdfReadError: EOF marker not found

It gives this error for any file, even for those that can successfully be read from file on the disk via open(filename, 'r')

am i missing something? thanks in advance!

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
funkifunki
  • 1,149
  • 2
  • 13
  • 24

1 Answers1

1

the solution is to use get_uploads from blobstore_handlers.BlobstoreUploadHandler:

from google.appengine.ext.webapp import blobstore_handlers
from cStringIO import StringIO
import PyPDF2

class UploadHandler(blobstore_handlers.BlobstoreUploadHandler):
    def post(self):
        upload_files = self.get_uploads('file')
        blob_info = upload_files[0]
        blob_reader = blobstore.BlobReader(blob_info)
        blob_content = StringIO(blob_reader.read())
        pdf_info = PyPDF2.PdfFileReader(blob_content)
funkifunki
  • 1,149
  • 2
  • 13
  • 24