1

I have been trying to read a large file and writing to another file at the same time after processing the data from the input file, the file is pretty huge around 4-8 GB, is there a way to parallelise the process to save the time

The original program is:

with open(infile,"r") as filein:
with open(writefile,"w") as filewrite: 
    with open(errorfile,"w") as fileerror:
        line=filein.readline()
        count=0
        filewrite.write("Time,Request,IP,MAC\n")
        while line:
            count+=1
            line=filein.readline()
            #print "{}: {}".format(count,line.strip()) testing content
            if requestp.search(line):
                filewrite.write(line.strip()[:15]+",")
                filewrite.write(requestp.search(line).group()+",")
                if IP.search(line):
                    filewrite.write(IP.search(line).group())
                filewrite.write(",")
                if MACp.search(line):
                    filewrite.write(MACp.search(line).group())
                filewrite.write("\n")
            else:
                fileerror.write(line)

But this takes too much time to process such a file and I have 100's of such files, I've tried using Ipyparellel to parallilise the code but have not met with success yet, is there a way to do this.

Harsh Sharma
  • 183
  • 1
  • 10
  • split your input file in chunks, send each chunk to a distinct process, and merge the results. Basically, use the map/reduce pattern. – bruno desthuilliers Jun 08 '18 at 11:08
  • IMHO you should not try to parallelize sequential io. At most you could try to split in three: reading, processing, writing to use the time where io operations are blocking to do the processing. – Serge Ballesta Jun 08 '18 at 11:44
  • @brunodesthuilliers can the files be splitted inside python itself ? – Harsh Sharma Jun 08 '18 at 12:17
  • @SergeBallesta how to do that? I couldn't find how to split reading, processing and writing ? – Harsh Sharma Jun 08 '18 at 12:21
  • At the simplest level, 3 processes connected by pipes. – Serge Ballesta Jun 08 '18 at 12:25
  • 1
    It might help if you actually said what you are trying to do, if you showed sample lines of input and corresponding output, if you stated your OS... I suspect it would go considerably faster with `awk`, if your OS has that, and with GNU Parallel if your OS has that, and if you used a different disk for input and output, if you have that. – Mark Setchell Jun 09 '18 at 15:56

0 Answers0