0

I am currently working with PDFMiner.six to extract text from multiple PDFs. Looking at my output I can see that I get some weird conversions of special characters like brakets:

Opening and closing brackets:

Finally, I delete all paragraphs 共defined as two lines containing text with a blank line before and after兲 with more than 50 percent

Other brackets:

enter image description here

具TEXT典

Plus:

Words+Tables

WORDS⫹TABLES

Minus:

(-0.141)

共⫺1.41兲

Test of (SML * COMPLEX-LRG * COMPLEX) < 0

Test of 共SML ⴱ COMPLEX⫺LRG ⴱCOMPLEX兲 ⬍ 0

I am using the following code:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
import os
import re
# from datetime import datetime
# a = datetime.now()
i = 0
path = r"C:\Users\1_T_Python"
save_to = r"C:\Users\1_T\txt files"

for filename in os.listdir(path):
    if filename.endswith(".pdf"):
        rsrcmgr = PDFResourceManager()
        retstr = StringIO()
        codec = 'utf-8'
        laparams = LAParams()
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        fp = open(path+"\\"+filename,'rb')
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        password = ""
        caching = True
        pagenos=set()
        for page in PDFPage.get_pages(fp, pagenos, password=password,caching=caching, check_extractable=True):
            interpreter.process_page(page)
        fp.close()
        device.close()
        string = retstr.getvalue()
        retstr.close()
        #print(string)
        with open(save_to+"\\"+filename+".txt", "w", encoding="utf-8") as text_file:
            text_file.write(string)
        i = i+1
        print(i)

I think this is an encode/decode issue, however could not find any solution on SO so far. Using utf-8 as encoding, I thought this should handle the problem, but it did not....

Any help appreciated!

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
Florian Schramm
  • 333
  • 3
  • 15
  • When you use copy-paste in the PDF viewer, what do you get? – lenz May 06 '19 at 11:51
  • You should find out if this is rather a problem with the PDF docs (eg. it's the same for all docs of the same source) or of the pdfminer library. – lenz May 06 '19 at 11:53
  • @lenz: seems to be a problem of certain PDF files.... Some were converted correctly, some do have to issue described.... If I use copy+paste in the PDF viewer I get just a blank space for all the "weird" characters.... – Florian Schramm May 06 '19 at 13:01

0 Answers0