1

I am trying to serialize a large (geometric) graph structure with the boost serialization library.

I store my graph as an adjacency list, that is, my structure is as follows:

class Node {
  double x,y;
  std::vector<Node*> adjacent_nodes;
  ...
}

class Graph {
  std::vector<Node*> nodes;
  ...
}

Now with > 10k nodes my problem is, that when I start to serialize (save) my graph, it will recursively call the serialization of all of those nodes before returning, since the graph is connected.

To be more precise, when serializing the Graph it will start by serializing the first node in the "nodes" vector. While doing so it needs to serialize "adjacent_nodes" of the first nodes, where e.g. the second node is contained.

Therefore it needs to serialize the second node before returning the serialization of the first node and so on.

I found this thread from 2010, where someone explained the exact same problem. However, they did not come to a working solution there.

Any help would be greatly appreciated.

My structure in more detail:

class Node {

    double x,y;
    std::vector<Node*> adjacent_nodes;

public:

    inline double get_x() const { return x; }
    inline double get_y() const { return y; }
    inline std::vector<Node*> const& get_adjacent_nodes() const { return adjacent_nodes; }

    Node (double x, double y):x(x),y(y) {}

    void add_adjacent(Node* other) {
        adjacent_nodes.push_back(other);
    }

private:

    Node() {}

  friend class boost::serialization::access;
  template <class Archive>
  void serialize(Archive &ar, const unsigned int) {
    ar & x;
        ar & y;
        ar & adjacent_nodes;
  }

};

class Simple_graph {

std::vector<Node*> nodes;

void add_edge(int firstIndex, int secondIndex) {
    nodes[firstIndex]->add_adjacent(nodes[secondIndex]);
    nodes[secondIndex]->add_adjacent(nodes[firstIndex]);
}

public:

/* methods to get the distance of points, to read in the nodes, and to generate edges */

~Simple_graph() {
    for (auto node: nodes) {
        delete node;
    }
}

private:

  friend class boost::serialization::access;
  template <class Archive>
  void serialize(Archive &ar, const unsigned int) {
    ar & nodes;
  }

};

Edit: To add some suggestions made in the above mentioned thread, citing Dominique Devienne:

1) save all the nodes without their topology info on a first pass of the vector, thus recording all the "tracked" pointers for them, then write the topology info for each, since then you don't "recurse", you only write a "ref" to an already serialized pointer.

2) have the possibility to write a "weak reference" to a pointer, which only adds the pointer to the "tracking" map with a special flag saying it wasn't "really" written yet, such that writing the topology of a node that wasn't yet written is akin to "forward references" to those neighboring nodes. Either the node will really be written later on, or it never will, and I suppose serialization should handle that gracefully.

#1 doesn't require changes in boost serialization, but puts the onus on the client code. Especially since you have to "externally" save the neighbors, so it's no longer well encapsulated, and writing a subset of the surface's nodes become more complex.

#2 would require seeking ahead to read the actual object when encountering a forward reference, and furthermore a separate map to know where to seek for it. That may be incompatible with boost serialization (I confess to be mostly ignorant about it).

Can any of those proposals be implemented by now?

cero
  • 158
  • 7

2 Answers2

1

Since you already have a vector with pointers to all your nodes, you can serialize the adjacent_nodes vector using indexes instead of serializing the actual node data.

You'll need to convert the this pointer to an index when serializing a node. This is simplest if you can store the node index in the node, otherwise you'll have to search thru nodes to find the right pointer (this process can be sped up by creating some sort of associative container to map the pointer to the index).

When you need to read in the data, you can create your initial nodes vector filled with pointers to empty/dummy nodes (which will get populated when they are serialized).

If that's not feasible, you can load the node indexes into a temporary array, then go back and populate the pointers once all the nodes have been read in. But you won't have to seek or re-read any parts of your file.

1201ProgramAlarm
  • 32,384
  • 7
  • 42
  • 56
  • Thank you for your answer. You suggested that, when reading back in, I create dummy nodes which get populated afterwards. How would I fit this into the deserialization process? Is there a possibility to ensure that the new nodes will be created at a predetermined dummy location? Or can I predict, where the nodes will be created? – cero Feb 09 '17 at 09:31
  • @cero I'm not familiar with the details of boost serialization, so I don't know if you could populate your vector with `Node *` before deserializing or if boost does that all the time. If that's the case, then you'll have to do something like I suggest in the last paragraph, where you keep all the adjacent node indexes, then go back and convert those to pointers when you're done reading everything in. If the adjacent nodes are always earlier in 'nodes' (so they'll have been deserialized already) then you can just reference the pointers that are created during the process. – 1201ProgramAlarm Feb 09 '17 at 19:53
  • After doing some more research, I suppose this is indeed the only way if I insist on using a data structure based on pointers. In my case I have replaced the vector of pointers by a vector of indices instead of a sophisticated two-way deserialization. Although not completely satisfying, I think your answer is as close as I can get. Thank you. I will mark your answer as accepted. – cero Feb 10 '17 at 00:11
0

If you do not have any large loops in the graph you can sort the Node vector in the way that the Nodes from the "end" of the graph appear at the beginning of the vector.

Example: let's say we have:

p1->p2->p3->....->p1000

you will fail if you try to serialize the vector v = {p1, p2, p3, ... , p1000} but it will work with the vector v = {p1000, p999, p998, ... , p1} but you have no chance if you have something like

p1->p2->p3->....->p1000->p1 
Saurabh Bhandari
  • 2,438
  • 4
  • 26
  • 33
Nikita
  • 1