0

I would like to calculate SVD from large matrix by Dask. However, I tried naively to create an empty 2D array and update in a loop, but Dask does not allow mutating the array. So, I'm looking for a workaround. I tried saving large ( around 65,000 x 65,000, or even more) array into HDF5 via h5py, but updating the array in a loop is quite inefficient. Should I be using mmap, memory mapped numpy instead?

Below, I shared a sample code, without any dask implementation. Should I use dask.bag or dask.delayed for this operation?

The sample code is taking in long strings and in window size of 8, generates combinations of two-letter words. In actual data, the window size would be 20 and words will be 8-letter long. And, the input string can be 3 Gb long.

import itertools
import numpy as np
np.set_printoptions(threshold=np.Inf)

# generate all possible words of length 2 (AA, AC, AG, AT, CA, etc.) 
# then get numerical index (AA -> 0, AC -> 1, etc.)
bases=['A','C','G','T']
all_two = [''.join(p) for p in itertools.product(bases, repeat=2)]
two_index = {x: y for (x,y) in zip(all_two, range(len(all_two)))}

# final array to fill, size is [ 16 possible words x 16 possible words ]
counts = np.zeros(shape=(16,16))  # in actual sample we expect 65000x65000 array

# sample sequences (these will be gigabytes long in actual sample)
seq1 = "AAAAACCATCGACTACGACTAC"
seq2 = "ACGATCACGACTACGACTAGATGCATCACGACTAAAAA"

# accumulate results
all_pairs=[]

def generate_pairs(sequence):
    pairs=[]
    for i in range(len(sequence)-8+1):
        window=sequence[i:i+8]
        words= [window[i:i+2] for i in range(0, len(window), 2)]
        for pair in itertools.combinations(words,2):
            pairs.append(pair)
    return pairs

# use function for each sequence
all_pairs.extend(generate_pairs(seq1))
all_pairs.extend(generate_pairs(seq2))

# convert 1D array of pairs into 2D counts of pairs
# for each pair, lookup word index and increase corresponding cell
for j in all_pairs:
    counts[ two_index[j[0]], two_index[j[1]] ] += 1

print(counts) 

EDIT: I might have asked the question a little complicated, let me try to paraphrase it. I need to construct a single large 2D array of size ~65000x65000. The array needs to be filled with counting occurrences of (word1,word2) pairs. Since Dask does not allow item assignment/mutate for Dask array, I can not fill the array as pairs are processed. Is there a workaround to generate/fill a large 2D array with Dask?

Here's simpler code to test:

import itertools
import numpy as np
np.set_printoptions(threshold=np.Inf)

bases=['A','C','G','T']
all_two = [''.join(p) for p in itertools.product(bases, repeat=2)]
two_index = {x: y for (x,y) in zip(all_two, range(len(all_two)))}

seq = "AAAAACCATCGACTACGACTAC"

counts = np.zeros(shape=(16,16))

for i in range(len(seq)-8+1):
    window=seq[i:i+8]
    words= [window[i:i+2] for i in range(0, len(window), 2)]
    for pair in itertools.combinations(words,2):
        counts[two_index[pair[0]], two_index[pair[1]]] += 1  # problematic part!

print(counts)
Alper Yilmaz
  • 319
  • 2
  • 11
  • It's not quite clear to me what are you trying to achieve. IIUC you want to generate several arrays and then use dask to calculate SVD? Do you mind to re elaborate? – rpanai Mar 11 '20 at 14:04
  • Hi @rpanai, thanks for your interest. I edited the question and added a simpler code. I'm trying to generate a **single** large array with Dask. – Alper Yilmaz Apr 21 '20 at 09:27

0 Answers0