Speed up Iteration Over Neighbors in a Graph

Question

I have a static graph (the topology does not change over time and is known at compile time) where each node in the graph can have one of three states. I then simulate a dynamic where a node has a probability of changing its state over time, and this probability depends on the state of its neighbors. As the graph grows larger the simulations start getting very slow, but after some profiling, I identified that most of the computation time was spent iterating over the list of neighbors.

I was able to improve the speed of the simulations by changing the data structure used to access neighbors in the graph but was wondering if there are better (faster) ways to do it. My current implementation goes like this:

For a graph with N nodes labeled from 0 to N-1 and average number of neighbors of K, I store each state as an integer in an std::vector<int> states and the number of neighbors for each node in std::vector<int> number_of_neighbors.

To store neighbors information I created two more vectors: std::vector<int> neighbor_lists which stores, in order, the nodes that are neighbors to node 0, node 1, ... , node N, and an index vector std::vector<int> index which stores, for each node, the index of its first neighbor in neighbor_lists.

So I have four vectors in total:

printf( states.size()              );    // N
printf( number_of_neighbors.size() );    // N
printf( neighbor_lists.size()      );    // N * k
printf( index.size()               );    // N

When updating node i I access its neighbors like so:

// access neighbors of node i:
for ( int s=0; s<number_of_neighbors[i]; s++ ) {
    int neighbor_node = neighbor_lists[index[i] + s];
    int state_of_neighbor = states[neighbor_node];

    // use neighbor state for stuff...
}

To sum up my question then: is there a faster implementation for accessing neighboring nodes in a fixed graph structure?

Currently, I've gone up to N = 5000 for a decent number of simulation time, but I was aiming for N ~ 15.000 if at all possible.

Some GPU functions iterate faster than CPU does. But I never had a look how to do that in c++. I just saw in lesson it was possible with pragma usage. — Brighter side, Oct 15 '17 at 12:35
Updated the question with the magnitude of N (~1.5e4). I have 32GB of ram available so I could do some estimates of how big an array I could declare. Thanks. — Kevin Liu, Oct 15 '17 at 14:27
How often does a node change states? How many neighbors does a node have (average and max)? If changes are rare, you may be able to store statistics about neighbors and update all neighbors when a node changes, instead of iterating over them to get those statistics. — Kenny Ostrom, Oct 15 '17 at 14:40

max66 · Answer 1 · 2017-10-15T16:02:57.253

It's important to know the order of magnitude of N because, if it isn't to high, you can use the fact that you know compile time the topology so you can put the data in std::arrays of known dimensions (instead of std::vectors), using the smallest possible type to (if necessary) save stack memory, ad define some of they as constexpr (all but states).

So, if N isn't too big (stack limit!), you can define

states as an std::array<std::uint_fast8_t, N> (8 bits for 3 state are enough)
number_of_neighbors as a constexpr std::array<std::uint_fast8_t, N> (if the maximum number of neighbors is less that 256, a bigger type otherwise)
neighbor_list as a constexpr std::array<std::uint_fast16_t, M> (where M is the known sum of the number of neighbors) if 16 bit are enough for N; a bigger type otherwise
index as a constexpr std::array<std::uint_fast16_t, N> if 16 bit are enough for M; a bigger type otherwise

I think (I hope) that using arrays of known dimensions that are constexpr (when possible) the compiler can create a fastest code.

Regarding the updating code... I'm a old C programmer so I'm used to trying to optimize the code in a way that modern compiler do better, so I don't know if the following code is a good idea; anyway, I would write the code like this

auto first = index[i];
auto top   = first + number_of_neighbors[i];

for ( auto s = first ; s < top ; ++s ) {
   auto neighbor_node = neighbor_lists[s];
   auto state_of_neighbor = states[neighbor_node];

   // use neighbor state for stuff...
}

-- EDIT --

The OP specify that

Currently, I've gone up to N = 5000 for a decent number of simulation time, but I was aiming for N ~ 15.000 if at all possible.

So 16 bit should be enough -- for the type in neighbor_list and in index -- and

states and number_of_neighbors are about 15 kB each (30 kB using a 16 bit variable)
index is about 30 kB.

It seems to me that are reasonable values for stack variables.

The problem could be neighbor_list; if the medium number of neighbor is low, say 10 to fix a number, we have that M (sum of neighbors) is about 150'000, so neighbor_list is about 300 kB; not low but reasonable for some environment.

If the medium number is high -- say 100, to fix another number --, neighbor_list become about 3 MB; it should be to high, in some environments.

You might also put the arrays in a struct to force them to be adjacent in memory and maybe improve locality. (Though if they all have static storage duration and are defined next to each other, that will probably happen anyway.) — aschepler, Oct 15 '17 at 13:27
I will try your suggestions and then update this thread, thanks. — Kevin Liu, Oct 15 '17 at 14:33
Is it possible to declare a constexpr using values stored in a file? Or would I need to copy-paste the constexpr array values into the source code directly? — Kevin Liu, Oct 15 '17 at 15:19
@KevinLiu - do you mean compile time initialize a constexpr variable reading values from a file? No; as far I know isn't possible. The best I can imagine is a first C++ (or gawk, or a shell script) program that read (run-time) the file and create (output) a second C++11 program with array values directly in source code. Nothing that a good makefile script can't manage. — max66, Oct 15 '17 at 15:25

score 0 · Answer 2 · answered Oct 21 '17 at 17:36

Currently you are accessing sum(K) nodes for each iteration. That doesn't sound so bad ... until you hit access the cache.

For less than 2^16 nodes you only need an uint16_t to identify a node, but with K neighbours you will need an uint32_t to index the neighbour list. The 3 states can as already mentioned be stored in 2 bits.

So having

// your nodes neighbours, N elements, 16K*4 bytes=64KB
// really the start of the next nodes neighbour as we start in zero.
std::vector<uint32_t> nbOffset;
// states of your nodes, N elements, 16K* 1 byte=16K
std::vector<uint8_t> states;
// list of all neighbour relations, 
// sum(K) > 2^16, sum(K) elements, sum(K)*2 byte (E.g. for average K=16, 16K*2*16=512KB
std::vector<uint16_t> nbList;

Your code:

// access neighbors of node i:
for ( int s=0; s<number_of_neighbors[i]; s++ ) {
    int neighbor_node = neighbor_lists[index[i] + s];
    int state_of_neighbor = states[neighbor_node];

    // use neighbor state for stuff...
}

rewriting your code to

uint32_t curNb = 0;
for (auto curOffset : nbOffset) {
  for (; curNb < curOffset; curNb++) {
    int neighbor_node = nbList[curNb]; // done away with one indirection.
    int state_of_neighbor = states[neighbor_node]; 

    // use neighbor state for stuff...
  } 
}

So to update one node you need to read the current state from states, read the offset from nbOffset and use that index to look up the neighbour list nbList and the index from nbList to look up the neighbours states in states.

The first 2 will most likely already be in L1$ if you run linearly through the list. Reading the first value from nbList for each node might be in L1$ if you compute the nodes linearly otherwise it will most likely cause a L1$ and likely a L2$ miss, the following reads would be hardware pre-fetched.

Reading linearly through the nodes has the added advantage that each neighbour list will only be read once per iteration of the node set and therefore the likelihood that states stay in L1$ will increase dramatically.

Decreasing the size of states could improve the the chance that it stayes in L1$ further, with a little calculation there can be store 4 states of 2 bits in each byte, reducing the size of states to 4KB. So depending on how much "stuff" you do you could have a very low cache miss rate.

But if you jump around in the nodes and do "stuff" the situation quickly gets worse inducing a nearly guaranteed L2$ miss for nbList and potential L1$ misses for the current node and the K calls to state. This could lead to slow downs by a factor 10 to 50.

If your in the latter scenario with random access you should consider storing an extra copy of the state in the neighbour list saving the cost of accessing states K times. You have to measure if this is faster.

Regarding in-lining the data in the program you would gain a little for not having to access the vector, I would in this case estimate it to less than 1% gain from that.

In-lining and constexpr aggressively your compiler would boil your computer for years and reply "42" as the final result of the program. You have to find a middle ground.

I know it is not advised, but I'll leave a thank you for your answer here for now. When I get to test the suggestions I'll update the question. Thanks! — Kevin Liu, Oct 27 '17 at 10:34

Speed up Iteration Over Neighbors in a Graph

2 Answers2