0

I have a function (method to be accurate) that's primary purpose is to zip multiple vectors (similar to the zip function in Python).

Both vectors of the windows variable are extremely large but considering that these are vectors of references, I believed memory usage wouldn't be an issue. I was wrong, this code constantly runs into memory allocation issues and crashes due to lack of memory.

Any performance improvements are wiped away as the code immediately hits the paging memory which is limited by the storage IO.

I have tried a few variations of the code, but all of them seem to run into the same issue.

In my understanding, the only large memory allocation occurs at the time of creation of the variable windows (which occurs in a different method). Once that initial allocation is completed, no additional large allocations should occur in the method create_zipped_kmers as it only deals with references to the original data. Is this the correct understanding?.

Is there anything I am doing wrong here in the code, or maybe a gap in my knowledge?. How do I go about reducing the memory usage while still maintaining the performance?.

Variation 1

    fn create_zipped_kmers(&'a self, windows: &'a Vec<Vec<&'a str>>) -> Vec<Vec<&'a str>> {
        if windows.is_empty() {
            panic!("Sequence k-mer vector cannot be empty");
        }

        let num_cols = windows.iter().map(|v| v.len()).min().unwrap_or(0);
        let num_rows = windows.len();

        let mut zipped = Vec::with_capacity(num_cols);
        for col_index in 0..num_cols {
            let column: Vec<&str> = (0..num_rows)
                .into_par_iter()
                .map(|row_index| windows[row_index][col_index])
                .collect();
            zipped.push(column);
        }

        zipped
    }

Variation 2

    fn create_zipped_kmers(&'a self, windows: &'a Vec<Vec<&'a str>>) -> Vec<Vec<&'a str>> {
        if windows.is_empty() {
            panic!("Sequence k-mer vector cannot be empty");
        }

        let num_cols = windows.iter().map(|v| v.len()).min().unwrap_or(0);
        let num_rows = windows.len();

        (0..num_cols)
            .into_par_iter()
            .map(|col_index| {
                (0..num_rows)
                    .map(|row_index| windows[row_index][col_index])
                    .collect::<Vec<&str>>()
            })
            .collect()
    }

Variation 3

    fn create_zipped_kmers(&'a self, windows: &'a Vec<Vec<&'a str>>) -> Vec<Vec<&'a str>> {
        if windows.is_empty() {
            panic!("Sequence k-mer vector cannot be empty");
        }

        let num_seqs = windows[0].len();
        let mut iters: Vec<_> = windows.par_iter().map(|n| n.into_iter()).collect();

        (0..num_seqs)
            .map(|_| {
                iters
                    .par_iter_mut()
                    .map(|n| *n.next().unwrap())
                    .collect()
            })
            .collect()
    }

Thanks.

Shane
  • 103
  • 1
  • 6
  • Can we get some numbers? How many vectors are in `windows` and how many elements are in those vectors? How much memory do you have and how much is already in use? – kmdreko Jul 08 '23 at 03:38
  • @kmdreko the outer vector contains 2293158 vectors where each vector contains 1308 &str. I've got 30GB of system memory. – Shane Jul 08 '23 at 03:51
  • `&str` takes up 2 words of memory, so even ignoring the overhead of `Vec`, `2293158*1308*16 bytes = 47GB`. – Dogbert Jul 08 '23 at 03:58
  • Hmm, a `&str` is 16 bytes (two pointers big) assuming you're on a 64-bit system and 2,293,158 x 1,308 x 16 is about 48GB... so you should be paging memory before it even reaches this function (unless you mean you had 30GB *to spare*). – kmdreko Jul 08 '23 at 03:58
  • @kmdreko no, you are right, not sure how I missed that. It's paging memory before reaching the above function. That explains a lot. Of the three variations above, which variation do you think would be the best to use as far as ideal memory usage goes? – Shane Jul 08 '23 at 04:04
  • 1
    Avoid #3 since that involves collecting 2million iterators upfront. The other two should be roughly the same in terms of memory usage. #2 looks the best to me just based on where `.into_par_iter()` is, but as always, measure! – kmdreko Jul 08 '23 at 04:45
  • @kmdreko tried #2 with a machine with 90GB of system memory and once again the process got killed by the kernel for memory usage once execution reached the above method. Even with considering the overheads, 90GB should be plenty to run this code. Any ideas? – Shane Jul 08 '23 at 06:51
  • 2
    @Shane your input is atleast 48gb and output is also atleast 48gb, plus the memory occupied by the strings' underlying data, plus the overhead of millions of Vec. You need much more than 90gb, probably atleast 100gb + the sum of bytes occupied by all strings. – Dogbert Jul 08 '23 at 06:57
  • @Dogbert thanks. Is the exact overhead per vector documented somewhere? – Shane Jul 08 '23 at 07:15
  • 1
    @Shane a Vec takes 3 words (24 bytes on 64 bit), plus the amount of memory allocated (`vec.capacity() * size of each element`), which may be more than `.len()` depending on how you create the vector. I would recommend loading less data first and seeing how much it consumes on your machine. – Dogbert Jul 08 '23 at 10:37

0 Answers0