0

I have file of around 1.5 Gb and I want to divide file into chunks so that I can use multi processing to process each chunk using pp(parallel python) module in python. Till now i have used f.seek in python but it takes a lot of time, as it may be seek increment the byte by byte.So what can be the alternate way? Can i do this through mrjob(map-reduce package) of python?

Sample code: I am doing something like this

def multi(i,slots,,file_name,date):
f1=open(date+'/'+file_name,"rb")
f1.seek(i*slots*69)
data=f1.read(69)
counter=0
print 'process',i
while counter<slots:
    ##do some processing
    counter+=1
    data=f1.read(69)

My each row contains a 69 bytes tupple data and Multi function is called parallely n time(here n is equal to slots) to do the job

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
Aman Jagga
  • 301
  • 5
  • 15
  • If you have several 1Gb files, you could have each thread process one such files. Maybe there is no need to divide these files into chunks? – Colin Bernet Mar 03 '14 at 10:17
  • No if I have to process only 1 file,I have to divide it into chunks – Aman Jagga Mar 03 '14 at 10:24
  • Ok, then the fact that you have several files is not relevant in your question. I would suggest to edit your post, so that it starts with: "I have a file of around 1 GB..." etc. – Colin Bernet Mar 03 '14 at 10:30
  • Your question makes no sense without some sample code. If you're seeking byte-by-byte then you're obviously doing something wrong. You need to explain what your code is trying to do (i.e. what problem you're trying to solve) and how you're trying to do it. Without some sample code, there's no way that we can help you. – Jim Mischel Mar 03 '14 at 11:07
  • I have added the sample code in ques--Jim Mischel – Aman Jagga Mar 03 '14 at 14:47

2 Answers2

1

Why not open multiple handles to the file? That way, you only need to 'seek' once per handle.

f1 = open('file')

f2 = open('file')
f2.seek(100) # in practice the number would be <file size>/<no of threads>

f3 = open('file')
f3.seek(200)
Jayanth Koushik
  • 9,476
  • 1
  • 44
  • 52
  • I am doing the same way but as i mentioned it takes time.Infact due to this,processing a single file by multi processing takes about 40 sec and with no multi processing it takes only 25 seconds. – Aman Jagga Mar 03 '14 at 10:36
  • 4
    Having multiple threads contend to read the same file is usually a recipe for decreased performance, unless you have specialised hardware and file-system support including multiple I/O channels. If your file sits on a hard-disk attached to a computer with a single I/O channel (which is typical of many desktops) you would probably be better having one thread read the file, then parcel it up into chunks for other processes to deal with. And read as much into memory as you can at each opportunity. – High Performance Mark Mar 03 '14 at 12:57
  • That makes a lot of sense. – Jayanth Koushik Mar 03 '14 at 12:58
  • But then this code is comparatively taking less time when used with multi processingdef multi(i,slots,,file_name,date): f1=open(date+'/'+file_name,"rb") data=f1.read(69) while data: ##do some processing data=f1.read(69) – Aman Jagga Mar 03 '14 at 15:03
1

The simplest way to do that is to have a common function that reads a record and returns it. But that function is protected with a lock. Something like the below. Note that I'm not a Python programmer, so you'll have to interpret my pseudo code.

f = open file
l = new lock

function read
    acquire lock
        read record
    release lock
    return record

Now, start a few threads, but no more than you have processor cores, each one of which does this:

while not end of file
    record = read();
    process record

So instead of starting a new thread for every record, you have a handful of persistent threads.

Another way to do this is to dedicate a thread to reading. It reads records and places them into a thread-safe queue. The queue is limited to some size (100 records, 10,000 records, whatever). The processing threads read from that queue. The advantage of this method is that the reading thread can fill the queue while the other threads are processing. The processing threads can then very quickly get the next record.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • No file is not opening for every record but for every process.So if I have 5 processes,file will be opened 5 times.Your method is good but in my case processing takes much less time than reading.So in any case i have to paralleize the reading. – Aman Jagga Mar 03 '14 at 15:23
  • @AmanJagga: My apologies. I read your code wrong. I've updated my answer. – Jim Mischel Mar 03 '14 at 17:09
  • 1
    @AmanJagga: You cannot parallelize the reading. The disk can only do one thing at a time. If processing takes much less time than reading, then your program is I/O bound and adding more threads will not help you. The *best* you can do is have one reading thread and one processing thread so that some of the the reading time overlaps the processing time. – Jim Mischel Mar 03 '14 at 17:11