3

I'm creating a python script of which parses a large (but simple) CSV.

It'll take some time to process. I would like the ability to interrupt the parsing of the CSV so I can continue at a later stage.

Currently I have this - of which lives in a larger class: (unfinished)

Edit:

I have some changed code. But the system will parse over 3 million rows.

def parseData(self)
    reader = csv.reader(open(self.file))
    for id, title, disc in reader:
        print "%-5s %-50s %s" % (id, title, disc)
        l = LegacyData()
        l.old_id = int(id)
        l.name = title
        l.disc_number = disc
        l.parsed = False
        l.save()

This is the old code.

def parseData(self):
        #first line start
        fields = self.data.next()
        for row in self.data:
            items = zip(fields, row)
            item = {}
            for (name, value) in items:
                item[name] = value.strip()
            self.save(item)

Thanks guys.

Glycerine
  • 7,157
  • 4
  • 39
  • 65
  • 1
    Define interrupt. Do you want to interrupt and restart the process? Or interrupt this script from another part of your python code? What is it that you're trying to accomplish? – Falmarri Jan 05 '11 at 00:16
  • Windows or Linux? If Linux-only, you can [stop/cont the process](http://tombuntu.com/index.php/2007/11/23/how-to-pause-a-linux-process/). – moinudin Jan 05 '11 at 00:25
  • I'm sorry I didn't explain myself correctly. Primarily I would like to just use this in the command line. Therefore when I interrupt the process - the script will understand and perform an action - such as call a def before death. Make sense? – Glycerine Jan 05 '11 at 00:28
  • @marcog - Windows only I'm afraid – Glycerine Jan 05 '11 at 00:28
  • 1
    @Glycerine In future, please tag as Windows. I've retagged it for you. – moinudin Jan 05 '11 at 00:30
  • 2
    I'm not a windows guy, but there's probably no easy way of doing this. You'll have to keep the process around (you can't kill it) between runs. Otherwise, you'll have to store the data you've already parsed to a file which completely defeats the purpose. – Falmarri Jan 05 '11 at 00:56
  • In both of your code samples you call a `save()` member that does something with the processed data. In order to restart the process its state with regard to both the input and output will have to be restored. For an input file it may be possible to do this with a `seek()` to a stored position. However it unclear what do with the processed data since we have no idea what happened to it in `save()`. – martineau Jan 05 '11 at 18:49
  • @Falmarri - Plus one dude for hitting the nail. I've been playing around with it. But yes - I don't want to just murder the process mid flow. But as long as I can capture an event taking place whilst this is running, I may deal with killing the process once its finished. – Glycerine Jan 05 '11 at 19:31
  • @martineau - Focusing on the updated (Edit:) code. This is a data Model the save() is the method on the model. So looping millions of records, saving each one - whilst allowing human interaction. – Glycerine Jan 05 '11 at 19:34

4 Answers4

2

If under linux, hit Ctrl-Z and stop the running process. Type "fg" to bring it back and start where you stopped it.

user318904
  • 2,968
  • 4
  • 28
  • 37
1

You can use signal to catch the event. This is a mockup of a parser than can catch CTRL-C on windows and stop parsing:

import signal, tme, sys

def onInterupt(signum, frame):
    raise Interupted()

try:
    #windows
    signal.signal(signal.CTRL_C_EVENT, onInterupt)
except:
    pass

class Interupted(Exception): pass
class InteruptableParser(object):

    def __init__(self, previous_parsed_lines=0):
        self.parsed_lines = previous_parsed_lines

    def _parse(self, line):
        # do stuff
        time.sleep(1) #mock up
        self.parsed_lines += 1
        print 'parsed %d' % self.parsed_lines

   def parse(self, filelike):
        for line in filelike:
            try:
                self._parse(line)
            except Interupted:
                print 'caught interupt'
                self.save()
                print 'exiting ...'
                sys.exit(0)

    def save(self):
        # do what you need to save state
        # like write the parse_lines to a file maybe
        pass

parser = InteruptableParser()
parser.parse([1,2,3])

Can't test it though as I'm on linux at the moment.

nate c
  • 8,802
  • 2
  • 27
  • 28
  • Looks basically like the right idea, although it's a little thin about restarting actual csv parsing in the middle. As nice feature would be to have the `save()` method remember where it left off and have it automatically restart itself at the point it was interrupted. – martineau Jan 05 '11 at 02:23
  • @martineau: My point is not to make production code - just to show the idea :-). Save state like the amount of lines parsed and then take that as optional argument in '__init__'? – nate c Jan 05 '11 at 04:22
1

The way I'd do it:

Puty the actual processing code in a class, and on that class I'd implement the Pickle protocol (http://docs.python.org/library/pickle.html ) (basically, write proper __getstate__ and __setstate__ functions)

This class would accept the filename, keep the open file, and the CSV reader instance as instance members. The __getstate__ method would save the current file position, and setstate would reopen the file, forward it to the proper position, and create a new reader.

I'd perform the actuall work in an __iter__ method, that would yeld to an external function after each line was processed.

This external function would run a "main loop" monitoring input for interrupts (sockets, keyboard, state of an specific file on the filesystem, etc...) - everything being quiet, it would just call for the next iteration of the processor. If an interrupt happens, it would pickle the processor state to an specific file on disk.

When startingm the program just has to check if a there is a saved execution, if so, use pickle to retrieve the executor object, and resume the main loop.

Here goes some (untested) code - the iea is simple enough:

from cPickle import load, dump
import csv
import os, sys

SAVEFILE = "running.pkl"
STOPNOWFILE = "stop.now"

class Processor(object):
    def __init__(self, filename):
        self.file = open(filename, "rt")
        self.reader = csv.reader(self.file)
    def __iter__(self):
        for line in self.reader():
            # do stuff
            yield None
    def __getstate__(self):
        return (self.file.name, self.file.tell())
    def __setstate__(self, state):
        self.file = open(state[0],"rt")
        self.file.seek(state[1])
        self.reader = csv.reader(self.File)

def check_for_interrupts():
    # Use your imagination here!  
    # One simple thing would e to check for the existence of an specific file
    # on disk.
    # But you go all the way up to instantiate a tcp server and listen to 
    # interruptions on the network
    if os.path.exists(STOPNOWFILE): 
        return True
    return False

def main():
    if os.path.exists(SAVEFILE):
        with open(SAVEFILE) as savefile:
            processor = load(savefile)
        os.unlink(savefile)
    else:
        #Assumes the name of the .csv file to be passed on the command line
        processor = Processor(sys.argv[1])
    for line in processor:
        if check_for_interrupts():
            with open(SAVEFILE, "wb") as savefile:
                dump(processor)
            break

if __name__ == "__main__":
    main()
jsbueno
  • 99,910
  • 10
  • 151
  • 209
  • The use of the `pickle` protocol to save Processor state and making its instances iterable are very clever (and feasible) ideas. However the idea of polling for interrupts isn't going to work for asynchronous interrupts such as those that result from typing CTRL+C in Windows. To handle them and save the state it'll be necessary to have a `try/except` block in effect at some level. – martineau Jan 05 '11 at 19:15
0

My Complete Code

I followed the advice of @jsbueno with a flag - but instead of another file, I kept it within the class as a variable:

I create a class - when I call it asks for ANY input and then begins another process doing my work. As its looped - if I were to press a key, the flag is set and only checked when the loop is called for my next parse. Thus I don't kill the current action. Adding a process flag in the database for each object from the data I'm calling means I can start this any any time and resume where I left off.

class MultithreadParsing(object):
    
    process = None
    process_flag = True
    
    def f(self):
        print "\nMultithreadParsing has started\n"
        while self.process_flag:
            ''' get my object from database '''
            legacy = LegacyData.objects.filter(parsed=False)[0:1]
            
            if legacy:
                print "Processing: %s %s" % (legacy[0].name, legacy[0].disc_number)
                for l in legacy:
                    ''' ... Do what I want it to do ...'''
                sleep(1)
            else:
                self.process_flag = False
                print "Nothing to parse"
                
        
    
    def __init__(self):
        self.process = Process(target=self.f)
        self.process.start()
        print self.process
        a = raw_input("Press any key to stop \n")
        print "\nKILL FLAG HAS BEEN SENT\n"
        
        if a:
            print "\nKILL\n"
            self.process_flag = False

Thanks for all you help guys (especially yours @jsbueno) - if it wasn't for you I wouldn't have got this class idea.

Community
  • 1
  • 1
Glycerine
  • 7,157
  • 4
  • 39
  • 65