0

I'm working with large graphs that are stored on disk in a binary format. Reading the graph from disk (SSD) and constructing the graph takes roughly an hour. Once constructed the graph never changes. The graph takes roughly 50GB of memory, which isn't a problem for the server. However, we often want to do lots of experiments on the graph and paying the hour graph-loading time gets expensive. I'm wondering if there is any way to persist the object in memory so that the JVM can essentially locate the object in memory.

I know the JVM has memory sharing between processes, but I haven't see anything that lets you share whole object without serializing the object to bytes (which would likely be slow given the expensive reconstruction). Database solutions also seem slow because of the bulk of the object (50GB). Since we aren't modifying the object (it's effectively static), I'm not concerned about concurrency issues between processes.

The best idea I've seen is to use a FileChannel to map the serialized object into memory using an always persistent JVM and then have the second JVM read from that FileChannel to deserialize the object. Any other suggestions would be much appreciated!

David Jurgens
  • 304
  • 1
  • 8
  • Any reason why you need 2 JVMs? Are they on the same host? What about regular old java object Serialization (post-reading and constructing)? If you can access the graph in segments, and are able to manage the offsets when you want to do work -- memory mapped files: https://howtodoinjava.com/java-7/nio/java-nio-2-0-memory-mapped-files-mappedbytebuffer-tutorial/ – Matt Pavlovich Jun 27 '18 at 03:41
  • @MattPavlovich I was hoping to use two JVMs so that one could "host" the large graph in memory persistently and then a second JVM can connect to grab the object as needed without paying the graph-construction cost. I'll likely create this second JVM multiple times (different analyses on the same graph). Also, yes, both JVMs would be on the same host. Serialization will work, but then I still have to pay the object construction cost, which is at least 20-30 minutes (it's a big graph). The memory-mapped file idea is what I was suggesting with the FileChannel object. :) – David Jurgens Jun 27 '18 at 04:32
  • 1
    Are you sure you really need an hour? Seems like an awful lot of time. Maybe you can speed up that part of it. Considerably. – user207421 Jun 27 '18 at 04:36
  • @EJP The graph is 10GB in a compressed binary format and has ~1B edges. The I/O takes 10-15 minutes alone and then construction of the graph data structure takes the remaining time. I would much rather it not take an hour, but we've already optimized this quite a bit and would just rather keep the object in memory for other JVMs to use. – David Jurgens Jun 27 '18 at 04:39
  • Well I don't see how serialization even to a memory-mapped file will be significantly faster than your existing format, and if it is maybe you should reconsider your existing format. – user207421 Jun 27 '18 at 04:56
  • @DavidJurgens are you saying that reading 10GB of data from SSD store take 10 to 15 minutes? Just curious. If you can share the graph structure and simple sample of persistence structure, others can provide better help. – gagan singh Jun 27 '18 at 04:57
  • 1
    just keep the JVM running and send queries to it instead? – the8472 Jun 27 '18 at 07:35
  • Construct a graph in off-heap memory. Persist it between JVM restarts by mapping file(s) at `/dev/shm`. See [this question](https://stackoverflow.com/q/35241715/3448419) – apangin Jun 27 '18 at 08:05
  • You could try using ChronicleMap as this is a Map which is build on a Memory Mapped file persisted, off heap, shared between processes and very low cost to get the first access. https://github.com/OpenHFT/Chronicle-Map – Peter Lawrey Jun 27 '18 at 09:54
  • @the8472 That might actually work. The ol' map-reduce send the computation to the data trick – David Jurgens Jun 27 '18 at 14:58
  • “The I/O takes 10-15 minutes alone…”—your SSD has a transfer rate of less than 17MB per second? No, that can’t be “the I/O alone”. There’s obviously some processing involved, so I can only second @EJP’s comment, think about how to accelerate that process. If the original file can’t be processed more efficiently, how about converting it to something that you can process easier? After all, that’s what your question about, create something that all JVMs can share, i.e. that is easier to import. – Holger Jul 03 '18 at 12:00

1 Answers1

1

I suggest using ChronicleMap (which I helped design)

It is:

  • persisted
  • shared
  • off heap
  • can be larger than main memory
  • has options to minimise the serialization cost.

e.g. https://github.com/OpenHFT/Chronicle-Map/blob/master/docs/CM_Tutorial.adoc

interface PostalCodeRange {
    int minCode();
    void minCode(int minCode);

    int maxCode();
    void maxCode(int maxCode);
}

ChronicleMap<Integer, PostalCodeRange> cityPostalCodes = ChronicleMap
    .of(CharSequence.class, PostalCodeRange.class)
    .averageKey("Amsterdam")
    .entries(50_000)
    .createOrRecoverPersistedTo(cityPostalCodesFile, false);

NOTE: The value in this cases is a flyweight over off heap memory which allows you to access fields without deserializaing an object.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • I don't think this helps in our case because I want the graph to be in main memory (no need for off heap) and the program will eventually access everything in the graph, often multiple times, so there's no need for a flyweight design. We looking into something similar with Redis but the TCP/IP overhead was too much. – David Jurgens Jun 27 '18 at 14:57
  • @DavidJurgens take away TCP and serialization to on heap costs, you get an off heap data structure when can be stored on a memory mapped file. – Peter Lawrey Jun 27 '18 at 15:32