c++ read file and store integer in vectors. ends up taking around 5 time more resident memory than actual file size

Question

I need to read in few input files(each contains a 2d matrix of integers) and store them in a vector of 2d vectors. below is code I wrote:

int main(int argc, char *argv[]) {
  /*
    int my_rank;
    int p;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
    MPI_Comm_size(MPI_COMM_WORLD, &p);
  */

  std::vector<std::vector<std::vector<int > > > matrices(argc);
  for(int i=1; i<argc; ++i){
      std::string line;
      std::ifstream fp(argv[i]);
      std::vector<std::vector<int> > matrix;

      if (fp.is_open()) {
          while (getline(fp, line)) {
              if(line!=""){
                  //add a new row to file
                  std::vector<int> newRow;
                  //parse each row put the values in the file buffer
                  std::stringstream buff(line);
                  //buffValue is each number in a row
                  int buffValue;
                  while (buff >> buffValue) {
                      newRow.push_back(buffValue);
                  }
                  matrix.push_back(newRow);
              }
          }
      }
      else {
          std::cout << "Failed to read files" << std::endl;
      }
     matrices.push_back(matrix);
  }
  //MPI_Finalize();
  return 0;
}

I have two questions here:

when I read in one single file of 175M, the program ended up taking 900M in resident memory. This is a problem because I usually need to read in 4 files with few hundred M's per file. and it will eventually take multiple G's of memory. Is this because of the way I read/store the integers?
If I uncomment the lines involve MPI, the resident memory usage goes up to 1.7G, is this normal or I'm doing something wrong here, I'm using MPICH.

The digit `1` in a plain text file takes up one byte, but `int a = 1;` probably takes up four bytes of memory. Just one reason. — Jonathan Potter, Mar 16 '16 at 21:41
@JonathanPotter Hi Jonathan, does that mean a four digit number 1234 in the file takes same amount of memory as int a=1234; ? — Ian Li, Mar 16 '16 at 21:43
The code may be fragmenting the heap. Try pre-allocating space when you create a `vector` object: `std::vector newRow(best_guess);` etc., where `best_guess` is a guess at the number of elements that are going to go in. — Pete Becker, Mar 16 '16 at 21:43
@IanLi , assuming typical ASCII encoding and a 32 bit `int`, yes. 4 characters will be 4 bytes and an `int` will be 4 bytes. You might also have a 64 bit (8 byte) `int` depending on your platform and compiler options. — user4581301, Mar 16 '16 at 22:02
Off topic: since you are using MPI I assume you're aiming for speed. Because a `vector` of `vector`s is not contiguous storage (each `vector` points to its own allocated block of memory) you may be taking a performance penalty when you traverse from one vector to the next. If every row in your your data is the same length you can allocate a 1D array and fake the 2D indexing with `row * numberColumns + column`. This keeps all of your data in one big block and makes it easy for the CPU to predict and cache. — user4581301, Mar 16 '16 at 22:17
@user4581301 Interesting thought on storing them in a single line! do you by chance know if calculating row * numberColumns + column will be slow since it's doing calculation? or is that fast enough and we can ignor it ? — Ian Li, Mar 16 '16 at 22:26
@PeteBecker I added this if(matrix.size()==0){ std::vector newRow; } else{ std::vector newRow(matrix[0].size()); } But I don't see improvement on memory used. I'm assuming it's because the vector is too small? — Ian Li, Mar 16 '16 at 22:29
It's one of those depends things. `vec[x][y]` winds up looking something like `(vec.array + x)->array + y`, so you are trading a pointer dereference for a multiplication. If you have lots of really short rows the calculation is insignificant compared to the increased cache misses. If you have long rows, you won't see as much of an improvement and maybe you will lose versus taking a reference to the row and accessing it as a 1D vector. Have to profile it to find out for sure. — user4581301, Mar 16 '16 at 22:48
@IanLi Have a look at [this answer](http://stackoverflow.com/a/15799557/1553090) I wrote a while ago for someone dealing with vector-of-vector matrices. I provided an alternative simple class that stores the data contiguously. It does require that each row in your matrix is the same length, however. If you used the pooling approach suggested in my answer here, you could easily check that each row is the same length when reading, and then construct my `SimpleMatrix` class accordingly. — paddy, Mar 16 '16 at 22:51

score 3 · Accepted Answer · edited May 23 '17 at 11:45

Vector-of-vector-of-vector is not an efficient structure to use. You have memory overheads of the vector classes themselves, plus the standard behaviour of push_back.

A vector will grow its memory exponentially when it needs to resize after push_back, in order to meet time complexity requirements. If your vector capacity is currently 10 values, and you add 11 values, then it will most likely resize its capacity to 20 values.

A side-effect of this growth is potential memory fragmentation. Vector memory is defined to be contiguous. The standard allocators do not have a realloc ability, as in C. So, they must allocate more memory elsewhere, move the data, and free the old storage. This can leave holes in memory that your program can't use for anything else. Not to mention reduce cache locality of your data, leading to poor performance.

You would be better off creating a more memory-efficient 2D structure for your matrices, and then push them on to a deque instead of vector. Here's one I prepared earlier ;). At the very least, if you must use vector-of-vector for the matrix, then pre-allocate it using vector::reserve.

If memory is more important to you than I/O, then it's not out of the question to read the file twice. The first time around, you obtain information about matrix sizes and row lengths. Then you pre-allocate all your structures, and read the file again.

Otherwise, using some kind of temporary pool to store your values for a matrix would be acceptable:

std::deque< std::vector< std::vector< int > > > matrices;
std::vector< size_t > columns;  // number of columns, indexed by row
std::vector< int > values;      // all values in matrix

columns.reserve( 1000 );   // Guess a reasonable row count to begin with
values.reserve( 1000000 ); // Guess reasonable value count to begin with

while( getline(fp, line) ) {
    if( line.empty() ) {
        AddMatrix( matrices, columns, values );
    } else {
        std::istringstream iss( line );
        size_t count = 0;
        for( int val; iss >> val; ) {
            values.push_back( val );
            count++;
        }
        columns.push_back( count );
    }
}

// In case last line in file was not empty, add the last matrix.
AddMatrix( matrices, columns, values );

And add the matrix something like this:

void AddMatrix( std::deque< std::vector< std::vector< int > > > & matrices,
                std::vector< size_t > & columns,
                std::vector< int > & values )
{
    if( columns.empty() ) return;

    // Reserve matrix rows
    size_t num_rows = columns.size();
    std::vector< std::vector< int > > matrix;
    matrix.reserve( num_rows );

    // Copy rows into matrix
    auto val_it = values.begin();
    for( size_t num_cols : columns )
    {
        std::vector< int > row;
        row.reserve( num_cols );
        std::copy_n( val_it, num_cols, std::back_inserter( row ) );
        matrix.emplace_back( row );
        val_it += num_cols;
    }

    // Clear the column and value pools for re-use.
    columns.clear();
    values.clear();
}

Finally, I recommend you choose an appropriate integer type from <cstdint> rather than leaving it up the compiler. If you need only 32-bit integers, use int_least32_t. If your data range fits in 16-bit integers, you'll save a lot of memory by using int_least16_t.

what do you mean by "creating a more memory-efficient 2D structure"? should I just use a 3D deque in this case or faster insertion? And thanks for the reading file twice tip! I'm not sure how long is it going to take to go through a file with 10million lines, I'll have to test it out! — Ian Li, Mar 16 '16 at 22:20
I meant some kind of structure where you can preallocate. Matrices do not lend themselves to vector-of-vector because they are rectangular. Whereas vector-of-vector can have arbitrary row lengths for each row. — paddy, Mar 16 '16 at 22:31
I've added some example code for a way that might be more efficient for reading vector-of-vector data. I used `deque` structures as a simple memory pool - holding temporary information about a matrix before adding it, then being recycled. Completely untested. Haven't tried compiling or anything. — paddy, Mar 16 '16 at 22:32
Oh, and I just realised the pool containers should be `vector`, since `vector::clear` is guaranteed not to release the memory. I'm not sure the same is true of `deque`. — paddy, Mar 16 '16 at 22:38
Thanks a lot, this and the template really helped me to figure things out, I'll play around it and see! — Ian Li, Mar 17 '16 at 15:11
You're welcome. Drop a comment in here with your final result. Curious to know how you get on. — paddy, Mar 17 '16 at 20:53

score 0 · Answer 2 · answered Mar 16 '16 at 22:16

I guess you are seeing the combination of 2 effects: Different size of the int + extra memory in vector.

I'm not sure if you are able to see the first effect, though an int takes about 4 bytes of memory (I think they are allowed to make this 8 bytes, though I have not yet seen implementations of it). The character on the other end only takes 1 byte per digit/char + 1 byte for the space. So if you would have a lot of small integer in there, the internal representation will be larger, though if you have a lot of large numbers, it will be smaller. Also check if you are comparing to the size of the file, or the size on disk, as some filesystems support compression!

A next effect that you will most likely notice is the capacity of the vector, as you most likely have many, this can give quite some overhead. In order to not having to realloc every insertion, the std::vector class has a capacity, this is the size it actually uses and will fill in with the objects you add.

Depending on the implementation, the capacity can grow. An example: Doubling the capacity every time you go over it: If you start with a capacity of 10 and you reach a size of 11, the capacity can go to 20, if you reach a size of 21, the capacity can go to 40 ... (Note: This is also the reason that reserving is important, as it will directly give you the right size)

So if you check the capacity and the size of every individual vector, this can be different. If this is really dramatic for you, you can call shrink_to_fit on the vector to realloc the capacity to the actual stored size.

Finally, the size of your program is also influenced by the application itself. I don't think it's gonna be influencing here, though if you happen to link a lot of shared objects and all are loaded during startup, some memory measurements can include the size of these shared objects as part of your programs memory.

Just wondering how do I check if I'm refering to the actual file size, I think I'm refering to size on disk (ls -lh). And I just tried to use shrink_to_fit, it didn't effect the memory usage for some reason... — Ian Li, Mar 16 '16 at 22:36
I know that on windows you can/could see in in the properties od that file. Haven't checked on linux yet — JVApen, Mar 17 '16 at 06:36

c++ read file and store integer in vectors. ends up taking around 5 time more resident memory than actual file size

2 Answers2