1

I have a large genetic dataset (X, Y coordinates), of which I can easily know one dimension (X) during runtime.

I drafted the following code for a matrix class which allows to specify the size of one dimension, but leaves the other one dynamic by implementing std::vector. Each vector is new'd using unique_ptr, which is embedded in a C-style array, also with new and unique_ptr.

class Matrix
{
private:

    typedef std::vector<Genotype> GenVec;
    typedef std::unique_ptr<GenVec> upGenVec;

    std::unique_ptr<upGenVec[]> m;
    unsigned long size_;

public:

    // ...

    // construct
    Matrix(unsigned long _size): m(new upGenVec[_size]), size_(_size)
    {
        for (unsigned long i = 0; i < this->size_; ++i)
            this->m[i] = upGenVec(new GenVec);
    }
};

My question:

Does it make sense to use this instead of std::vector< std::vector<Genotype> > ?

My reasoning behind this implementation is that I only require one dimension to be dynamic, while the other should be fixed. Using std::vectors could imply more memory allocation than needed. As I am working with data that would fill up estimated ~50GB of RAM, I would like to control memory allocation as much as I can.

Or, are there better solutions?

Stingery
  • 432
  • 4
  • 16
  • I have to add that I am new to `unique_ptr`. Hence, I am curious to explore it, but cannot say if this would work as intended. – Stingery Sep 29 '14 at 16:44
  • You could use std::array if you want to be sure that your first dimension won't be changed. – danadam Sep 29 '14 at 16:55
  • I do not know the size of the one dimension during compile-time, only during runtime. The Matrix is in a sense dynamic in both dimensions, but one of those gets fixed during runtime. – Stingery Sep 29 '14 at 16:58
  • I think this is a process problem, not a coding problem. Can you explain your start-to-finish process? where are you getting data from(user input, files, db, etc)? why is it in a massive matrix? how is the program being run(commandline, ongoing, etc)? and what do you hope to do with the matrix data(operations, analysis, etc)? – user1269942 Sep 29 '14 at 17:21
  • The process is as follows: (1) read data from files, line-by-line, line elements are parsed and put in vectors. their size is always the same in runtime (which is why I can know one dimension from reading the first line) (2) matrix is filled with line elements, but: each line is a column, and there are unknown many lines. in the matrix, columns become rows (3) then, each matrix row is iterated over, which is the bulk of processing. the matrix is not iterated over columns – Stingery Sep 29 '14 at 17:32

1 Answers1

1

I won't cite any paragraphs from specification, but I'm pretty sure that std::vector memory overhead is fixed, i.e. it doesn't depend on number of elements it contains. So I'd say your solution with C-style array is actually worse memory-wise, because what you allocate, excluding actual data, is:

  • N * pointer_size (first dimension array)
  • N * vector_fixed_size (second dimension vectors)

In vector<vector<...>> solution what you allocate is:

  • 1 * vector_fixed_size (first dimension vector)
  • N * vector_fixed_size (second dimension vectors)
danadam
  • 3,350
  • 20
  • 18
  • I thought the `vector>` solution would be worse memory-wise, as the first-dimension capacity can be larger than required. I know the first-dimension size at runtime and want to fix it, only having the second-dimension capacity to be dynamic; i.e. there will always be X rows, but the number of columns, Y, is unknown. – Stingery Oct 02 '14 at 10:40
  • If you `push_back()` N times to first dimension vector and let it auto-grow then yes, it may reserve more memory than needed. But if you call `reserve(N)` before those N `push_back()` or even call `resize(N)` instead of N `push_back()` then the vector should use just the amount of memory to hold N elements and not more. – danadam Oct 02 '14 at 11:02