-1

I am trying to extract text from 3000+ PDFs in one txt file (while I had to remove headers from each page):

for x in range(len(files)-len(files)+15):
    pdfFileObj=open(files[x],'rb')
    pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
    for pageNum in range(1,pdfReader.numPages):
        pageObj=pdfReader.getPage(pageNum)
        content=pageObj.extractText()
        removeIndex = content.find('information.') + len('information.')
        newContent=content[removeIndex:]
        file.write(newContent)
file.close()

However, I get the following error:

return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufb02' in position 5217: character maps to <undefined>
Yuna Luzi
  • 318
  • 2
  • 9
  • 2
    Where is the variable `file` initialized ? Also print the error instead of printing "oops" refer https://wiki.python.org/moin/HandlingExceptions – Saicharan S M May 06 '16 at 21:39
  • 1
    Your except block is essentially silencing any useful error information from ever being shown. Remove it, run your code, then tell us what errors actually occur. – James Scholes May 06 '16 at 21:40
  • @SaicharanSM the variable file is initialized before the loop begins: file=open('allText.txt', 'w') – Yuna Luzi May 09 '16 at 13:35
  • The error has nothing to do with your title. (A dirty fix would be to `contents.replace('\ufb02', 'fl')` - the better fix is to use an encoding that supports this character). – Jongware May 09 '16 at 14:04
  • Could you try to open your output file using [`codecs.open()`](https://docs.python.org/3/library/codecs.html#codecs.open) instead of the `open()` you're using now, passing appropriate encoding information to the function (btw, have a look at the linked documentation). – gboffi May 09 '16 at 14:18

1 Answers1

0

I was not able to check the encoding of each PDF so I just used replace(). Below is the working code:

for x in range(len(files)):
    pdfFileObj=open(os.path.join(filepath,files[x]),'rb')
    for pageNum in range(1,pdfReader.numPages):
        pageObj=pdfReader.getPage(pageNum)
        content=pageObj.extractText()
        removeIndex = content.find('information.') + len('information.')
        newContent=content[removeIndex:]
        newContent=newContent.replace('\n',' ')
        newContent=newContent.replace('\ufb02','FL')
        file.write(str(newContent.encode('utf-8')))
file.close()
Yuna Luzi
  • 318
  • 2
  • 9