Best way to get (millions of rows of) data into Janusgraph via Tinkerpop, with a specific model

Question

Just started out with Tinkerpop and Janusgraph, and I'm trying to figure this out based on the documentation.

I have three datasets, each containing about 20 milions rows (csv files)
There is a specific model in which the variables and rows need to be connected, e.g. what are vertices, what are labels, what are edges, etc.
After having everything in a graph, I'd like to of course use some basic Gremlin to see how well the model works.

But first I need a way to get the data into Janusgraph.

Possibly there exist scripts for this. But otherwise, is it perhaps something to be written in python, to open a csv file, get each row of a variable X, and add this as a vertex/edge/etc. ...? Or am I completely misinterpreting Janusgraph/Tinkerpop?

Thanks for any help in advance.

EDIT:

Say I have a few files, each of which contain a few million rows, representing people, and several variables, representing different metrics. A first example could look like thid:

             metric_1    metric_2    metric_3    ..

person_1        a           e           i
person_2        b           f           j
person_3        c           g           k
person_4        d           h           l
..

Should I translate this to files with nodes that are in the first place made up of just the values, [a,..., l]. (and later perhaps more elaborate sets of properties)

And are [a,..., l] then indexed?

The 'Modern' graph here seems to have an index (number 1,...,12 for all the nodes and edges, independent of their overlapping label/category), e.g. should each measurement be indexed separately and then linked to a given person_x to which they belong?

Apologies for these probably straightforward questions, but I'm fairly new to this.

Does each dataset map to a different graph ? Have you already configured a storage backend ? — Benoit Guigal, Nov 14 '18 at 14:24
In this case there are several datasets (csv files) that should become one graph. (In another case I will use only one dataset.) For the storage backend: I've downloaded ScyllaDB and I performed step 1 & 2 https://www.scylladb.com/download/debian9/ -> since I only want to use this on my desktop, not in a cluster (yet) I have not done step 3. Should I? — nikolai, Nov 14 '18 at 16:13
Ok great. For testing purposes though, I would recomend using the script `bin/janusgraph.sh` which will start Cassandra, ElasticSearch and a gremlin-server. You will then be free in the future to tune which storage backend you want to use — Benoit Guigal, Nov 14 '18 at 16:56
Thanks, I'll download Cassandra and do as stated here https://docs.janusgraph.org/latest/cassandra.html but do I need to use/download ElasticSearch as well? Also, does this not interfere with ScyllaDB? — nikolai, Nov 14 '18 at 17:25
If you use the script janusgraph.sh you do not have to download anything, Cassandra and Elasticsearch are packaged with JanusGraph. Indeed you have to stop ScyllaDB to avoid conflicting port binding — Benoit Guigal, Nov 14 '18 at 18:01

score 7 · Answer 1 · answered Nov 16 '18 at 02:42

Well, the truth is bulk loading of real user data into JanusGraph is a real pain. I've been using JanuGraph since it's very first version about 2 years ago and its still a pain to bulk load data. A lot of it is not necessarily down to JanusGraph because different users have very different data, different formats, different graph models (ie some mostly need one vertex with one edge ( ex. child-mother ) others deal with one vertex with many edges ( ex user followers ) ) and last but definitely not least, the very nature of the tool deals with large data sets, not to mention the underlying storage and index databases mostly come preconfigured to replicate massively (i.e you might be thinking 20m rows but you actually end up inserting 60m or 80m entries)

All said, I've had moderate success in bulk loading a some tens of millions in decent timeframes (again it will be painful but here are the general steps).

Provide IDs when creating graph elements. If importing from eg MySQL think of perhaps combining the tablename with the id value to create unique IDs eg users1, tweets2
Don't specify schema up front. This is because JanusGraph will need to ensure the data conforms on each inserting
Don't specify index up front. Just related to above but really deserves its own entry. Bulk insert first index later
Please, please, please, be aware of the underlying database features for bulk inserts and activate them i.e read up on Cassandra, ScyllaDB, Big Table, docs especially on replication and indexing
After all the above, configure JanusGraph for bulk loading, ensure your data integrity is correct (i.e no duplicate ids) and consider some form of parallelizing insert request e.g some kind of map reduce system

I think I've covered the major points, again, there's no silver bullet here and the process normally involves quite some trial and error for example the bulk insert rates, too low is bad e.g 10 per second while too high is equally bad eg 10k per second and it almost always depends on your data so its a case by case basis, can't recommend where you should start.

All said and done, give it a real go, bulk load is the hardest part in my opinion and the struggles are well worth the new dimension it gives your application.

All the best!

score 6 · Accepted Answer · answered Nov 14 '18 at 17:27

JanusGraph uses pluggable storage backends and indexs. For testing purposes, a script called bin/janusgraph.sh is packaged with the distribution. It allows to quickly get up and running by starting Cassandra and Elasticsearch (it also starts a gremlin-server but we won't use it)

cd /path/to/janus
bin/janusgraph.sh start

Then I would recommend loading your data using a Groovy script. Groovy scripts can be executed with the Gremlin console

bin/gremlin.sh -e scripts/load_data.script

An efficient way to load the data is to split it into two files:

nodes.csv: one line per node with all attributes
links.csv: one line per link with source_id and target_id and all the links attributes

This might require some data preparation steps.

Here is an example script

The trick to speed up the process is to keep a mapping between your id and the id created by JanusGraph during the creation of the nodes.

Even if it is not mandatory, I strongly recommend you to create an explicit schema for your graph before loading any data. Here is an example script

Thanks, this is very helpful! I've updated my question to specify one bit, as I'm also not sure what is exactly meant by 'Id'. — nikolai, Nov 14 '18 at 18:32
You're welcome, I am not sure to understand your graph structure. Are the persons in your files the vertices of the graph ? What are the links between the persons ? I recommend reading this presentation (Slides 5 to 10) https://www.slideshare.net/ptgoetz/large-scale-graph-analytics-with-janusgraph about JanusGraph and graph structure. — Benoit Guigal, Nov 15 '18 at 08:40

Best way to get (millions of rows of) data into Janusgraph via Tinkerpop, with a specific model

2 Answers2

Linked