0

I have downloaded a bunch of pdfs from this source: 'http://ec.europa.eu/growth/tools-databases/cosing/index.cfm?fuseaction=search.detailsPDF_v2&id=28157

Now I want to scrape the PDF's by using PyPDF2, however no text is returned.

I tested the code with another pdf and it worked without a problem.

all_files = os.listdir('C:/Users/NAME.NAME/Downloads/Eu/T/')
count=0
count2=0
for filenames in all_files: 
   count +=1
   file_path='C:/Users/NAME.NAME/Downloads/Eu/T/'+filenames
   pdf_obj=open(file_path, 'rb')
   pdf_reader = PyPDF2.PdfFileReader(pdf_obj)
   num_pages = pdf_reader.numPages
   current_page=0
   text2=""
   pageObj= pdf_reader.getPage(current_page)
   text2 +=pageObj.extractText()

1 Answers1

0

This is because PyPDF2 is a inconsistent scraper . You have to remember that not all pdfs are built the same, so based on the structure that the pdf was built PyPDF2 may or may not be able to scrape it.

Usually when I am scraping pdfs, I have to switch between PyPDF2, pdfminer, and slate3k depending on if I get text using PyPDF2 or not. I start with PyPDF2 since it is the easiest in my opinion.

My order of robustness (how well the package can scrape pdfs):

1.) pdfminer

2.) slate3k

3.) PyPDF2

Using slate3k:

import glob as glob
all_files = r'C:/Users/NAME.NAME/Downloads/Eu/T/*.pdf'
for filenames in glob.glob(all_files): 
    with open(filenames,'rb') as f:
       pdf_text = slate.PDF(f)
       print(text)

Using pdfminer

import glob as glob
import io
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos = set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,
                                  password=password,
                                  caching=caching,
                                  check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text
    
all_files = r'C:/Users/NAME.NAME/Downloads/Eu/T/*.pdf'
    
for files in glob.glob(all_files):
    convert_pdf_to_txt(files)   

 

You may need to change the functions to get the text in the format you want it in. As I said since PDFs can be built in so many ways your text can be outputted in numerous different ways. But this should get you in the right direction.

Edeki Okoh
  • 1,786
  • 15
  • 27