1

I have a program that currently generates large arrays and matrices that can be upwards of 10GB in size. The program uses MPI to parallelize workloads, but is limited by the fact that each process needs its own copy of the array or matrix in order to perform its portion of the computation. The memory requirements make this problem unfeasible with a large number of MPI processes and so I have been looking into Boost::Interprocess as a means of sharing data between MPI processes.

So far, I have come up with the following which creates a large vector and parallelizes the summation of its elements:

#include <cstdlib>
#include <ctime>
#include <functional>
#include <iostream>
#include <string>
#include <utility>

#include <boost/interprocess/managed_shared_memory.hpp>
#include <boost/interprocess/containers/vector.hpp>
#include <boost/interprocess/allocators/allocator.hpp>
#include <boost/tuple/tuple_comparison.hpp>
#include <mpi.h>

typedef boost::interprocess::allocator<double, boost::interprocess::managed_shared_memory::segment_manager> ShmemAllocator;
typedef boost::interprocess::vector<double, ShmemAllocator> MyVector;

const std::size_t vector_size = 1000000000;
const std::string shared_memory_name = "vector_shared_test.cpp";

int main(int argc, char **argv) {
    int numprocs, rank;

    MPI::Init();
    numprocs = MPI::COMM_WORLD.Get_size();
    rank = MPI::COMM_WORLD.Get_rank();

    if(numprocs >= 2) {
        if(rank == 0) {
            std::cout << "On process rank " << rank << "." << std::endl;
            std::time_t creation_start = std::time(NULL);

            boost::interprocess::shared_memory_object::remove(shared_memory_name.c_str());
            boost::interprocess::managed_shared_memory segment(boost::interprocess::create_only, shared_memory_name.c_str(), size_t(12000000000));

            std::cout << "Size of double: " << sizeof(double) << std::endl;
            std::cout << "Allocated shared memory: " << segment.get_size() << std::endl;

            const ShmemAllocator alloc_inst(segment.get_segment_manager());

            MyVector *myvector = segment.construct<MyVector>("MyVector")(alloc_inst);

            std::cout << "myvector max size: " << myvector->max_size() << std::endl;

            for(int i = 0; i < vector_size; i++) {
                myvector->push_back(double(i));
            }

            std::cout << "Vector capacity: " << myvector->capacity() << " | Memory Free: " << segment.get_free_memory() << std::endl;

            std::cout << "Vector creation successful and took " << std::difftime(std::time(NULL), creation_start) << " seconds." << std::endl;
        }

        std::flush(std::cout);
        MPI::COMM_WORLD.Barrier();

        std::time_t summing_start = std::time(NULL);

        std::cout << "On process rank " << rank << "." << std::endl;
        boost::interprocess::managed_shared_memory segment(boost::interprocess::open_only, shared_memory_name.c_str());

        MyVector *myvector = segment.find<MyVector>("MyVector").first;
        double result = 0;

        for(int i = rank; i < myvector->size(); i = i + numprocs) {
            result = result + (*myvector)[i];
        }
        double total = 0;
        MPI::COMM_WORLD.Reduce(&result, &total, 1, MPI::DOUBLE, MPI::SUM, 0);

        std::flush(std::cout);
        MPI::COMM_WORLD.Barrier();

        if(rank == 0) {
            std::cout << "On process rank " << rank << "." << std::endl;
            std::cout << "Vector summing successful and took " << std::difftime(std::time(NULL), summing_start) << " seconds." << std::endl;

            std::cout << "The arithmetic sum of the elements in the vector is " << total << std::endl;
            segment.destroy<MyVector>("MyVector");
        }

        std::flush(std::cout);
        MPI::COMM_WORLD.Barrier();

        boost::interprocess::shared_memory_object::remove(shared_memory_name.c_str());
    }

    sleep(300);
    MPI::Finalize();

    return 0;
}

I noticed that this causes the entire shared object to be mapped into each processes' virtual memory space - which is an issue with our computing cluster as it limits virtual memory to be the same as physical memory. Is there a way to share this data structure without having to map out the entire shared memory space - perhaps in the form of sharing a pointer of some kind? Would trying to access unmapped shared memory even be defined behavior? Unfortunately the operations we are performing on the array means that each process eventually needs to access every element in it (although not concurrently - I suppose its possible to break up the shared array into pieces and trade portions of the array for those you need, but this is not ideal).

Andrew O
  • 68
  • 6
  • "which is an issue with our computing cluster as it limits virtual memory to be the same as physical memory" - why does that matter? Per-process virtual address space (the addresses at which a program may map physical memory) is distinct from virtual memory (which is about simulating extra physical memory using disk space as swap). Are you saying your individual processes are running out of virtual address space? – Tony Delroy Jul 10 '13 at 08:03
  • Hmm, I don't think so, as it is a 64-bit machine as it'd be difficult to run out of virtual address space. We have no disk drives to act as swap on the cluster, so there is no paging area and thus once we run out of physical memory, that's it. I might be wrong on this, but I think accessing a shared memory object using this library actually maps the entirety of the object to each processes' address space - and this costs memory causing the memory usage to go up? – Andrew O Jul 10 '13 at 08:16
  • "and this costs memory causing the memory usage to go up?" - not unless you've explicitly used shared memory in a make-a-private-copy-on-write mode and then written to it. Otherwise, across the whole host there'll still only be one physical memory page used for any given part of the data. Perhaps you can calculate when it would fail if memory were duplicated, then try to overload it? – Tony Delroy Jul 10 '13 at 08:22
  • It seems to me that you might be trying to solve a high level algorithmic problem with low level technical details. If your algorithm is using huge data and every process needs to access everything then it will inherently not scale well. Are you sure there is no way for a better decomposition of the global data and a different distribution of work based on locality? – Zulan Jul 10 '13 at 09:06
  • 1
    It would be helpful to know more about the access patterns to the data, especially regarding updates. – Zulan Jul 10 '13 at 09:07
  • That would have been the behavior I was expecting too. In the provided example the first process creates a vector of doubles totaling 8 GB in shared memory. However, an issue arises when the second part of the code is executed, where each process node picks up a pointer to the vector in shared memory and attempts to read from it - I totally expect only one physical copy of the vector to be present at any one time, but it seems the system reserves memory for each process reading it (memory usage goes up at this point) at the very least, causing memory issues. – Andrew O Jul 10 '13 at 09:09
  • Thanks for your insight Zulan, potential operations like row reduction of matrices would need to grab rows and replace them with new values by subtracting other rows in the matrix - so the algorithm seems to minimally require write access to one row at a time, but needs read access to multiple (if not all) rows. – Andrew O Jul 10 '13 at 09:13
  • In Boost.Interprocess information can be shared using a file, kernel or memory. Since you are using a cluster, the only option would be file, which means Boost.Interprocess is just an abstraction of the Answer suggested by jxh. Therefore the performance issues I mention there apply. General shared memory abstractions for message passing systems are always very leaky with respect to performance. For an efficient solution you need to consider the high level characteristics of the problem/algorithm. I really think you actually want to *distribute* the data and not *share* it. – Zulan Jul 10 '13 at 09:39

1 Answers1

-1

Since the data you want to share is so large, it may be more practical to treat the data as a true file, and use file operations to read the data that you want. Then, you do not need to use shared memory to share the file, just let each process read directly from the file system.

ifstream file ("data.dat", ios::in | ios::binary);
file.seekg(someOffset, ios::beg);
file.read(array, sizeof(array));
jxh
  • 69,070
  • 8
  • 110
  • 193
  • 1
    Thanks for the suggestion. My impression is that disk speeds will quickly become a serious bottleneck when you try to scale up the number of processors. It warrants some testing nevertheless. – Andrew O Jul 10 '13 at 09:02
  • 1
    This is not a good idea. MPI has a much better chance to solve this issue with reasonable performance than the parallel file system. Also as soon as there are updates to the data the file system will kill you. – Zulan Jul 10 '13 at 09:03
  • @Zulan: Say you need a solution for a 2 GB system. Then, mapping everything into virtual memory will incur paging anyway. The only way to make the shared memory solution efficient is to get more RAM. If you replace your disk with an SSD on your 2 GB system, you may achieve an acceptable level of performance. – jxh Jul 10 '13 at 09:11
  • @jhx: this is not true for a parallel (cluster) system, especially one with no local disk. It is vastly cheaper to copy data between RAM of different nodes than between RAM and the parallel file system. Furthermore accessing the parallel file system with low level calls from each node often results in very bad performance compared to reading it from one node and distributing it between the nodes. This is the reason why high level parallel I/O libraries exist. Your idea may work well for an SMP System with a local ext3, but for a cluster with e.g. Lustre it will not. – Zulan Jul 10 '13 at 09:20
  • @Zulan: I am assuming a local disk for each node. I am assuming a one time transfer of data. I am assuming messages will be passed to express relevant diffs to the file from other nodes. I am assuming local changes for the final result can be stored locally, and that the final result can be gathered at the end. That's a lot of assumptions, but it's how I typically architect a distributed solution. – jxh Jul 10 '13 at 09:25
  • @user2567440: You will need SSD and use asynchronous I/O or use multiple threads to achieve maximal disk throughput. – jxh Jul 10 '13 at 19:05