3

I asked similar question for Python DASK earlier. I learned that DASK didn't support mutating the matrix/2D array. So, I was interested in if it's possible to achieve this in Julia. The scenario is;

  • I'll go through a large text (genome actually) and find pairs of words (DNA sequences) that appear in same window. The window will scan whole genome so it will generate quite a lot of word pairs. (approximately, there will be billion windows and from each window 200 pairs will be generated so total number of pairs is expected to be around 200 billion)
  • for each pair, I want to update the count (or co-occurrence) matrix which is expected to be around 65,000 x 65,000.
  • the matrix is not sparse, it will be a dense one
  • when matrix is completed, it will be used for SVD calculation

So, can I achieve this in Julia? Is there a feasible way to fill such a matrix with streaming data. I'm guessing it should be memory-mapped and disk-based storage but not sure if such approaches will tolerate too many updates. When I checked for possible solutions I couldn't find packages that fit the bill. Julia ecosystem is getting better by day, so I wanted to inquire again. Thanks.

Alper Yilmaz
  • 319
  • 2
  • 11
  • 1
    Julia should just work for this. You probably want at least 64gb of ram (preferably 128gb), as the matrix will be 32GB or so. Also, computing the SVD will take a while (probably >1 cpu day). – Oscar Smith May 08 '21 at 03:12
  • I was more interested in a disk-based solution which will require modest amount of RAM. By that way, with that kind of RAM, any programming language can store such large array in memory. For instance, Numpy can be used for that. But it will be painfully slow. DASK can process the data in parallel and workers do not require huge RAM. I was interested in similar solutions.. – Alper Yilmaz May 08 '21 at 03:16
  • I think you might be underestimating how slow this would be disk based. If I'm doing my math right, the SVD would require a around 8.7 petabytes of reads and writes, which means most SSDs will die before they finish running this. I might be off in my numbers (if you can do large amounts of work per read/write), but even then, this would take an ungodly amount of time. – Oscar Smith May 08 '21 at 03:44
  • agreed, it's a challenging task.. that's why I was asking help from Julians ;) – Alper Yilmaz May 08 '21 at 08:35
  • I just want to add that if you are working with genome data there is a whole organization of Julia packages for that: https://github.com/BioJulia Please try to search for existing solutions before wasting time reinventing the wheel. Most newcomers don't need to know how to write efficient code from the start, they can just use a package from an experienced programmer. – juliohm May 08 '21 at 11:32

0 Answers0