Python - How to convert many separate PDFs to text?

Question

Question: How can I read in many PDFs in the same path using Python package "slate"?

I have a folder with over 600 PDFs.

I know how to use the slate package to convert single PDFs to text, using this code:

migFiles = [filename for filename in os.listdir(path)
if re.search(r'(.*\.pdf$)', filename) != None]
with open(migFiles[0]) as f:
     doc = slate.PDF(f)

 len(doc)

However, this limits you to one PDF at a time, specified by "migFiles[0]" - 0 being the first PDF in my path file.

How can I read in many PDFs to text at once, retaining them as separate strings or txt files? Should I use another package? How could I create a "for loop" to read in all the PDFs in the path?

Whoever voted down, give him a reason too... – Sterling Archer May 17 '13 at 02:28 — Sterling Archer, May 17 '13 at 02:28

David Ding · Answer 1 · 2013-05-19T16:23:30.633

What you can do is use a simple loop:

docs = []
for filename in migFiles:
   with open(filename) as f:
     docs.append(slate.pdf(f)) 
     # or instead of saving file to memory, just process it now

Then, docs[i] will hold the text of the (i+1)-th pdf file, and you can do whatever you want with the file whenever you want. Alternatively, you can process the file inside the for loop.

If you want to convert to text, you can do:

docs = []
separator = ' ' # The character you want to use to separate contents of
#  consecutive pages; if you want the contents of each pages to be separated 
# by a newline, use separator = '\n'
for filename in migFiles:
   with open(filename) as f:
     docs.append(separator.join(slate.pdf(f))) # turn the pages into plain-text

or

separator = ' ' 
for filename in migFiles:
   with open(filename) as f:
     txtfile = open(filename[:-4]+".txt",'w')
     # if filename="abc.pdf", filename[:-4]="abc"
     txtfile.write(separator.join(slate.pdf(f)))
     txtfile.close()

Thank you. Once I've appended the PDFs to "docs", do you know how I can convert all of the PDFs to text, or write them to .txt, so that I can search and parse them? — EJS, May 19 '13 at 01:34

Burhan Khalid · Answer 2 · 2013-05-19T14:23:09.243

Try this version:

import glob
import os

import slate

for pdf_file in glob.glob("{}/{}".format(path,"*.pdf")):
   with open(pdf_file) as pdf:
        txt_file = "{}.txt".format(os.path.splitext(pdf_file)[0])
        with open(txt_file,'w') as txt:
             txt.write(slate.pdf(pdf))

This will create a text file with the same name as the pdf in the same directory as the pdf file with the converted contents.

Or, if you want to save the contents - try this version; but keep in mind if the translated content is large you may exhaust your available memory:

import glob
import os

import slate

pdf_as_text = {}

for pdf_file in glob.glob("{}/{}".format(path,"*.pdf")):
   with open(pdf_file) as pdf:
        file_without_extension = os.path.splitext(pdf_file)[0]
        pdf_as_text[file_without_extension] = slate.pdf(pdf)

Now you can use pdf_as_text['somefile'] to get the text contents.

Thank you. I tried this code, however I received the error message: `Traceback (most recent call last):` `File "", line 1, in ` `TypeError: 'module' object is not callable` Do you know how to solve the "'module' objet is not callable" issue for glob here? — EJS, May 19 '13 at 01:08

Python - How to convert many separate PDFs to text?

2 Answers2