4

I have some questions related to setting the maximum running time of a function in Python. In fact, I would like to use pdfminer to convert the .pdf files to .txt.

The problem is that very often, some files are not possible to decode and take extremely long time. So I want to set threading.Timer() to limit the conversion time for each file to 5 seconds. In addition, I run under windows so I cannot use the signal module for this.

I succeeded in running the conversion code with pdfminer.convert_pdf_to_txt() (in my code it is "c"), but I am not sure that the in the following code, threading.Timer() works. (I don't think it properly constrains the time for each processing)

In summary, I want to:

  1. Convert the pdf to txt

  2. Time limit for each conversion is 5 sec, if it runs out of time, throw an exception and save an empty file

  3. Save all the txt files under the same folder

  4. If there are any exceptions/errors, still save the file but with empty content.

Here is the current code:

import converter as c
import os
import timeit
import time
import threading
import thread

yourpath = 'D:/hh/'

def iftimesout():
    print("no")

    with open("D:/f/"+g+"&"+t+"&"+name+".txt", mode="w") as newfile:
        newfile.write("")


for root, dirs, files in os.walk(yourpath, topdown=False):
    for name in files:
        try:
           timer = threading.Timer(5.0,iftimesout)
           timer.start()
           t=os.path.split(os.path.dirname(os.path.join(root, name)))[1]
           a=str(os.path.split(os.path.dirname(os.path.join(root, name)))[0])
           g=str(a.split("\\")[1])

           with open("D:/f/"+g+"&"+t+"&"+name+".txt", mode="w") as newfile:
                newfile.write(c.convert_pdf_to_txt(os.path.join(root, name)))
                print("yes")

           timer.cancel()

         except KeyboardInterrupt:
               raise

         except:
             for name in files:
                 t=os.path.split(os.path.dirname(os.path.join(root, name)))[1]
                 a=str(os.path.split(os.path.dirname(os.path.join(root, name)))[0])

                 g=str(a.split("\\")[1])
                 with open("D:/f/"+g+"&"+t+"&"+name+".txt", mode="w") as newfile:
                     newfile.write("") 
SXC88
  • 227
  • 1
  • 5
  • 16
  • Will think on it one more time :) – linusg Nov 22 '16 at 18:04
  • @linusg that's so nice! Thx :)) – SXC88 Nov 22 '16 at 18:05
  • This should do it, finally :) – linusg Nov 22 '16 at 18:52
  • @SXC88, I have no experience with `pdfminer`, but I've checked that it contains no `convert_pdf_to_txt()` method neither `converter.convert_pdf_to_txt()`... Do you mean `pdfminer.PDFConverter`? – Andersson Nov 23 '16 at 09:17
  • Hi I just posted the converter.convert_pdf_to_txt() function below if you want to have a look, but I can actually convert all those files without problem but once I try to add time constraints to it, the code doesn't work properly... @Andersson – SXC88 Nov 23 '16 at 15:34
  • @SXC88 - I finally got it. See my totally updated answer! – linusg Nov 23 '16 at 17:26

2 Answers2

5

I finally figured it out!

First of all, define a function to call another function with a limited timeout:

import multiprocessing

def call_timeout(timeout, func, args=(), kwargs={}):
    if type(timeout) not in [int, float] or timeout <= 0.0:
        print("Invalid timeout!")

    elif not callable(func):
        print("{} is not callable!".format(type(func)))

    else:
        p = multiprocessing.Process(target=func, args=args, kwargs=kwargs)
        p.start()
        p.join(timeout)

        if p.is_alive():
            p.terminate()
            return False
        else:
            return True

What does the function do?

  • Check timeout and function to be valid
  • Start the given function in a new process, which has some advantages over threads
  • Block the program for x seconds (p.join()) and allow the function to be executed in this time
  • After the timeout expires, check if the function is still running

    • Yes: Terminate it and return False
    • No: Fine, no timeout! Return True

We can test it with time.sleep():

import time

finished = call_timeout(2, time.sleep, args=(1, ))
if finished:
    print("No timeout")
else:
    print("Timeout")

We run a function which needs one second to finish, timeout is set to two seconds:

No timeout

If we run time.sleep(10) and set the timeout to two seconds:

finished = call_timeout(2, time.sleep, args=(10, ))

Result:

Timeout

Notice the program stops after two seconds without finishing the called function.

Your final code will look like this:

import converter as c
import os
import timeit
import time
import multiprocessing

yourpath = 'D:/hh/'

def call_timeout(timeout, func, args=(), kwargs={}):
    if type(timeout) not in [int, float] or timeout <= 0.0:
        print("Invalid timeout!")

    elif not callable(func):
        print("{} is not callable!".format(type(func)))

    else:
        p = multiprocessing.Process(target=func, args=args, kwargs=kwargs)
        p.start()
        p.join(timeout)

        if p.is_alive():
            p.terminate()
            return False
        else:
            return True

def convert(root, name, g, t):
    with open("D:/f/"+g+"&"+t+"&"+name+".txt", mode="w") as newfile:
        newfile.write(c.convert_pdf_to_txt(os.path.join(root, name)))

for root, dirs, files in os.walk(yourpath, topdown=False):
    for name in files:
        try:
           t=os.path.split(os.path.dirname(os.path.join(root, name)))[1]
           a=str(os.path.split(os.path.dirname(os.path.join(root, name)))[0])
           g=str(a.split("\\")[1])
           finished = call_timeout(5, convert, args=(root, name, g, t))

           if finished:
               print("yes")
           else:
               print("no")

               with open("D:/f/"+g+"&"+t+"&"+name+".txt", mode="w") as newfile:
                   newfile.write("")

        except KeyboardInterrupt:
             raise

       except:
           for name in files:
                t=os.path.split(os.path.dirname(os.path.join(root, name)))[1]
                a=str(os.path.split(os.path.dirname(os.path.join(root, name)))[0])

               g=str(a.split("\\")[1])
               with open("D:/f/"+g+"&"+t+"&"+name+".txt", mode="w") as newfile:
                   newfile.write("") 

The code should be easy to understand, if not, feel free to ask.

I really hope this helps (as it took some time for us to get it right ;))!

linusg
  • 6,289
  • 4
  • 28
  • 78
0

Check following code and let me know in case of any issues. Also let me know whether you still want to use force termination feature (KeyboardInterruption)

path_to_pdf = "C:\\Path\\To\\Main\\PDFs" # No "\\" at the end of path!
path_to_text = "C:\\Path\\To\\Save\\Text\\" # There is "\\" at the end of path!
TIMEOUT = 5  # seconds
TIME_TO_CHECK = 1  # seconds


# Save PDF content into text file or save empty file in case of conversion timeout
def convert(path_to, my_pdf):
    my_txt = text_file_name(my_pdf)
    with open(my_txt, "w") as my_text_file:
         try:
              my_text_file.write(convert_pdf_to_txt(path_to + '\\' + my_pdf))
         except:
              print "Error. %s file wasn't converted" % my_pdf


# Convert file_name.pdf from PDF folder to file_name.text in Text folder
def text_file_name(pdf_file):
    return path_to_text + (pdf_file.split('.')[0]+ ".txt")


if __name__ == "__main__":
    # for each pdf file in PDF folder
    for root, dirs, files in os.walk(path_to_pdf, topdown=False):
        for my_file in files:
            count = 0
            p = Process(target=convert, args=(root, my_file,))
            p.start()
            # some delay to be sure that text file created
            while not os.path.isfile(text_file_name(my_file)):
                time.sleep(0.001)
            while True:
                # if not run out of $TIMEOUT and file still empty: wait for $TIME_TO_CHECK,
                # else: close file and start new iteration
                if count < TIMEOUT and os.stat(text_file_name(my_file)).st_size == 0:
                    count += TIME_TO_CHECK
                    time.sleep(TIME_TO_CHECK)
                else:
                    p.terminate()
                    break
Andersson
  • 51,635
  • 17
  • 77
  • 129
  • Hi I have a new post if you want to have a look ;)) http://stackoverflow.com/questions/40828450/python-copy-folder-structure-under-another-directory @Andersson – SXC88 Nov 27 '16 at 11:34