1

I am trying to simulate some dna-sequencing reads, and,in order to speed-up the code, I am in need to run it on parallel.

Basically, what I am trying to do is the following:I am sampling reads from the human genome, and I think that one the two process from multiprocessing module try to get data from the same file (the human genome) the processes gets corrupted and it is not able to get the desired DNA sequence. I have tried different things, but I am very new to parallel programming and I cannot solve my problem

When I run the script with one core it works fine.

This is the way I am calling the function

if __name__ == '__main__':
    jobs = []
    # init the processes
    for i in range(number_of_cores):
        length= 100
        lock = mp.Manager().Lock()
        p = mp.Process(target=simulations.sim_reads,args=(lock,FastaFile, "/home/inigo/msc_thesis/genome_data/hg38.fa",length,paired,results_dir,spawn_reads[i],temp_file_names[i]))
        jobs.append(p)
        p.start()
    for p in jobs:
        p.join()

And this is the function I am using to get the reads, were each process writes the data to a different file.

def sim_single_end(lc,fastafile,chr,chr_pos_start,chr_pos_end,read_length, unique_id):

    lc.acquire()
    left_split_read = fastafile.fetch(chr, chr_pos_end - (read_length / 2), chr_pos_end)
    right_split_read = fastafile.fetch(chr, chr_pos_start, chr_pos_start + (read_length / 2))
    reversed_left_split_read = left_split_read[::-1]
    total_read = reversed_left_split_read + right_split_read
    seq_id = "id:%s-%s|left_pos:%s-%s|right:%s-%s " % (unique_id,chr, int(chr_pos_end - (read_length / 2)), int(chr_pos_end), int(chr_pos_start),int(chr_pos_start + (read_length / 2)))
    quality = "I" * read_length
    fastq_string = "@%s\n%s\n+\n%s\n" % (seq_id, total_read, quality)
    lc.release()
    new_record = SeqIO.read(StringIO(fastq_string), "fastq")
    return(new_record)

Here is the traceback:

Traceback (most recent call last):
  File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/inigo/Dropbox/PycharmProjects/circ_dna/simulations.py", line 107, in sim_ecc_reads
   new_read = sim_single_end(lc,fastafile, chr, chr_pos_start, chr_pos_end, read_length, read_id)
   File "/home/inigo/Dropbox/PycharmProjects/circ_dna/simulations.py", line 132, in sim_single_end
   new_record = SeqIO.read(StringIO(fastq_string), "fastq")
   File "/usr/local/lib/python3.5/dist-packages/Bio/SeqIO/__init__.py", line 664, in read
   first = next(iterator)
   File "/usr/local/lib/python3.5/dist-packages/Bio/SeqIO/__init__.py", line 600, in parse
for r in i:
   File "/usr/local/lib/python3.5/dist-packages/Bio/SeqIO/QualityIO.py", line 1031, in FastqPhredIterator
for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
   File "/usr/local/lib/python3.5/dist-packages/Bio/SeqIO/QualityIO.py", line 951, in FastqGeneralIterator
% (title_line, seq_len, len(quality_string)))

 ValueError: Lengths of sequence and quality values differs  for id:6-chr1_KI270707v1_random|left_pos:50511537-50511587|right:50511214-50511264 (0 and 100).
Praderas
  • 471
  • 4
  • 14
  • You haven't shown the traceback of the error. If the processes write to a different file each time, you could randomise all their read lines beforehand (ensuring no overlap), chunk that list and pass a chunk to each process. As long as you prevent them from reading the file at the same time, each process would only look at their pre-allocated chunks – roganjosh Feb 20 '17 at 22:22
  • 1
    You've included the traceback except the error itself. – roganjosh Feb 20 '17 at 22:31
  • @roganjosh Hi! I have added the traceback. I have try to prevent them from reading the file at the same time with `Lock()`. However it has not worked. How should I prevent them from reading the file at the same time? – Praderas Feb 20 '17 at 22:33
  • @roganjosh I have added the traceback before Biopython and the other packages begin to clame because I am trying to write an empty sequence. I have considered it as not relevant. However, just in case, I add it. – Praderas Feb 20 '17 at 22:35
  • I have added the complete traceback, thanks for your help – Praderas Feb 20 '17 at 22:40
  • Each process could open the same file for reading which should prevent them from interfering with each other in that respect. – martineau Feb 20 '17 at 22:48
  • @martineau what do you think is going on here? Do you think it's a concurrent read issue? – roganjosh Feb 20 '17 at 22:51
  • @roganjosh: It's not possible to determine from the code provided. My suggestion would eliminate that from being a possible cause of any issues. – martineau Feb 20 '17 at 22:57
  • @martineau I'm just trying to narrow down for OP as the stack trace wasn't as illuminating as I hoped :( I would personally try and segregate the data from the file between processes but I have a feeling I underestimate the size of a genome in this regard. – roganjosh Feb 20 '17 at 23:00
  • @martineau I have tried to open the genome inside the process, and it has not worked – Praderas Feb 20 '17 at 23:04
  • I have added the biopython tag and upvoted. This seems like a specialist problem to me, I don't think I can help from the error message you were given, sorry. Best of luck. – roganjosh Feb 20 '17 at 23:06
  • @roganjosh: Your approach might work, too, but sounds like it would involve a non-trivial amount of additional code as well as modifications to existing code. I think/thought my suggestion would be relatively easy to implement—but apparently not... – martineau Feb 20 '17 at 23:09
  • @martineau I'm not sure whether your last comment means that the suggestion was overlooked but as far as I can tell, it's passed as an argument to the MP function. If that wasn't what you intended, I'm also stuck on where to implement it tbh; each process is getting a different reference to the file this way? – roganjosh Feb 20 '17 at 23:14
  • @roganjosh: As I said, there's not enough code in the question to determine what's going on. To allow each process to open the same file would only require passing the filename to each process. If each one opened the file for reading only, it would be a different reference, which would allow each process to access it whatever manner it needed to without affecting others that might also be reading it. – martineau Feb 20 '17 at 23:28
  • can you share the code of `sim_reads` and the initialization of `FastaFile`? (in case it's an instance and not a class name) – hansaplast Feb 21 '17 at 05:31
  • Hi, I have wrote to the authors of the package I am using to get the sequences and they have told me that it could an issue on the program, thanks everyone for your help – Praderas Feb 21 '17 at 14:11

1 Answers1

1

I am the OP of this answer that I did almost a year ago. The problem was that the package that I was using for reading the human genome file (pysam) was failing. The issue was a typo when calling multiprocessing.

From the authors respose, this should work:

 p = mp.Process(target=get_fasta, args=(genome_fa,))

note the ',' to ensure you pass a tuple

See https://github.com/pysam-developers/pysam/issues/409 for more details

Praderas
  • 471
  • 4
  • 14