-1

Question: How can I read in many PDFs in the same path using Python package "slate"?

I have a folder with over 600 PDFs.

I know how to use the slate package to convert single PDFs to text, using this code:

migFiles = [filename for filename in os.listdir(path)
if re.search(r'(.*\.pdf$)', filename) != None]
with open(migFiles[0]) as f:
     doc = slate.PDF(f)

 len(doc)

However, this limits you to one PDF at a time, specified by "migFiles[0]" - 0 being the first PDF in my path file.

How can I read in many PDFs to text at once, retaining them as separate strings or txt files? Should I use another package? How could I create a "for loop" to read in all the PDFs in the path?

EJS
  • 1
  • 1
  • 2

2 Answers2

0

What you can do is use a simple loop:

docs = []
for filename in migFiles:
   with open(filename) as f:
     docs.append(slate.pdf(f)) 
     # or instead of saving file to memory, just process it now

Then, docs[i] will hold the text of the (i+1)-th pdf file, and you can do whatever you want with the file whenever you want. Alternatively, you can process the file inside the for loop.

If you want to convert to text, you can do:

docs = []
separator = ' ' # The character you want to use to separate contents of
#  consecutive pages; if you want the contents of each pages to be separated 
# by a newline, use separator = '\n'
for filename in migFiles:
   with open(filename) as f:
     docs.append(separator.join(slate.pdf(f))) # turn the pages into plain-text

or

separator = ' ' 
for filename in migFiles:
   with open(filename) as f:
     txtfile = open(filename[:-4]+".txt",'w')
     # if filename="abc.pdf", filename[:-4]="abc"
     txtfile.write(separator.join(slate.pdf(f)))
     txtfile.close()
David Ding
  • 201
  • 2
  • 3
  • Thank you. Once I've appended the PDFs to "docs", do you know how I can convert all of the PDFs to text, or write them to .txt, so that I can search and parse them? – EJS May 19 '13 at 01:34
0

Try this version:

import glob
import os

import slate

for pdf_file in glob.glob("{}/{}".format(path,"*.pdf")):
   with open(pdf_file) as pdf:
        txt_file = "{}.txt".format(os.path.splitext(pdf_file)[0])
        with open(txt_file,'w') as txt:
             txt.write(slate.pdf(pdf))

This will create a text file with the same name as the pdf in the same directory as the pdf file with the converted contents.

Or, if you want to save the contents - try this version; but keep in mind if the translated content is large you may exhaust your available memory:

import glob
import os

import slate

pdf_as_text = {}

for pdf_file in glob.glob("{}/{}".format(path,"*.pdf")):
   with open(pdf_file) as pdf:
        file_without_extension = os.path.splitext(pdf_file)[0]
        pdf_as_text[file_without_extension] = slate.pdf(pdf)

Now you can use pdf_as_text['somefile'] to get the text contents.

Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284
  • Thank you. I tried this code, however I received the error message: `Traceback (most recent call last):` `File "", line 1, in ` `TypeError: 'module' object is not callable` Do you know how to solve the "'module' objet is not callable" issue for glob here? – EJS May 19 '13 at 01:08