How to bulk load data into a dgraph/standalone:graphql container?

Question

Assuming I've a db like the quick-start of https://graphql.dgraph.io/docs/quick-start/

i.e.

type Product {
    productID: ID!
    name: String @search(by: [term])
    reviews: [Review] @hasInverse(field: about)
}

type Customer {
    custID: ID!
    name: String @search(by: [hash, regexp])
    reviews: [Review] @hasInverse(field: by)
}

type Review {
    id: ID!
    about: Product! @hasInverse(field: reviews)
    by: Customer! @hasInverse(field: reviews)
    comment: String @search(by: [fulltext])
    rating: Int @search
}

Now I would like to import millions of entries and therefore would like to use the bulk loader. My dataset is a bug folder full of .json files.

To what I've seen, I should be able to run a command like dgraph bulk -f folderOfJsonFiles -s goldendata.schema --map_shards=4 --reduce_shards=2 --http localhost:8000 --zero=localhost:5080

But to run my server, I am using the dgraph/standalone:graphql image ran docker run -v $(pwd):/dgraph -p 9000:9000 -it dgraph/standalone:graphql

Now how to start the bulk import ?

1: Should I run the command within the docker container itself (and share the volume (folder) containing all my .json files ) or install dgraph on my host and run the dgraph bulk command from the host ?

2: What should be the format of the .json files ?

3: Would the bulk loader support blank nodes (id which are not _:0x1234) ?

[edit]

bulk loader seems not to support graphql schema, the schema should be converted to rdf first. To achieve this, I exported the schema and data right after importing the graphql schema curl 'localhost:8080/admin/export?format=json'

score 0 · Answer 1 · answered Nov 16 '19 at 20:47

Here a few things to understand:

the bulk loader is not an offline version of the live loader. It is a tool which purpose is to prepare the data for the Dgraph Alpha(s) server(s).
the bulk loader, seems to be only able to load triples
the bulk loader can load a schema and files but this is not the graphql schema, the graphql schema must be loaded apart later.

So to answer the question:

start the dgraph graphql server using docker run -v $(pwd)/dgraph:/dgraph -p 8000:8000 -p 9000:9000 -p 8080:8080 -p 9080:9080 -p 5080:5080 -it dgraph/standalone:graphql for your information, this image launch the /tmp/run.sh script which will itself run dgraph-ratel & dgraph zero & dgraph alpha --lru_mb $lru_mb & dgraph graphql (where lru_mb is the memory you give to dgraph alpha). Keep the container's id for later find it using docker ps if you lost it.
Unless you have + 5 millions of entries (or no time), try using the live loader. If you have troubles with the live loader like: it became very slow after few hundred of thousands entries (300k in my case), this is very likely because your alpha does not have sufficient memory. In my case, I had to tune docker to provide 16Gb of memory to the engine, the script gives to the $lru_mb variable a third of the host memory.
Once you imported your full set of data using live loader, you can export the data using docker exec -it yourDockerContainerId curl localhost:8080/admin/export?format=json, the export will generate 2 files for instance: g01.json.gzand g01.schema.gz which corresponds to your entries and their schema (which is not the graphql schema).
To import those 2 files g01.json.gzand g01.schema.gz back to your dgraph graphql instance, you need to convert them to group’s "p" directory output. To what I understood, the "p" directory holds all the data for the Dgraph Alpha. If you delete it, you lose your data, if you replace it with another set, you will replace / restore the data with the one you just copied. Bulk loader is not an instance of dgraph, it is only the tool which will generate those "p" directory outputs. I have been successful running it within the container. Just run docker exec -it yourDockerContainerId dgraph bulk -f export/pathTo/g01.json.gz -s export/pathTo/g01.schema.gz --map_shards=1 --reduce_shards=1 --http localhost:8001 --zero=localhost:5080. I will be honest, I do not understand the purpose of the http localhost:8001 argument in this command. If the bulk loader ran successfully, it created an out/0/p folder containing the data you can use in your Dgraph Alpha. Stop your docker container docker stop yourDockerContainerId then Replace your current Dgraph Alpha's p folder with the one generated by bulk loader. (Re)start your docker container and you should have your imported data. (perhaps trash the w and zw folders as well, I have no clue about their use).
The data is imported but you will have an warning saying something like there is no graphql schema. Okay let's import our schema (assuming you have it at path dgraph/schemas/schema.graphql) schema=$(cat dgraph/schemas/schema.graphql | tr '\\n' ' ');jq -n --arg schema \"$schema\" '{ query: \"mutation addSchema($sch: String!) { addSchema(input: { schema: $sch }) { schema { schema } } }\", variables: { sch: $schema }}' | curl -X POST -H \"Content-Type: application/json\" http://localhost:9000/admin -d @- This might take few minutes as graph will likely have to index your data according to your graphql schema's indexing rule (typically related to the @search decorator)

You're done…

Now, I am still not completely answering the question because the data we are importing back is the one we just exported (and the one we actually imported using the live loader). So unfortunately, the bulk loader cannot import nice data like live loader, you have to feed him with triples. Therefore you have to prepare the data to load using bulk loader in that format. To help you in this talk, I suggest to

Run the dgraph graphql server docker run -v $(pwd)/dgraph:/dgraph -p 8000:8000 -p 9000:9000 -p 8080:8080 -p 9080:9080 -p 5080:5080 -it dgraph/standalone:graphql
import a graphql schema (assuming the schema is at path dgraph/schemas/schema.graphql ) schema=$(cat dgraph/schemas/schema.graphql | tr '\\n' ' ');jq -n --arg schema \"$schema\" '{ query: \"mutation addSchema($sch: String!) { addSchema(input: { schema: $sch }) { schema { schema } } }\", variables: { sch: $schema }}' | curl -X POST -H \"Content-Type: application/json\" http://localhost:9000/admin -d @-
create one or two basic / template entries using a graphql client. You can install the Altair chrome extension, connect to http://localhost:9000/graphql then add some data, something like:

mutation {
  addCustomer(input:{name:"Toto"}){
    name
  }
}

You can also using a file and the live loader

Then export your small template data docker exec -it yourDockerContainerId curl localhost:8080/admin/export?format=json
Open the g01.json.gz and you will find an example of the data the bulk loader expects to be fed with.

What about blank ids ? I am not sure but as the bulk loader is doing a 2 levels mapping on ids, I can imagine you can provide your ids and those will be converted to dgraph ids later.

Dgraph Bulk Loader also supports loading JSON data. https://docs.dgraph.io/deploy/#fast-data-loading — Daniel Mai, Dec 13 '19 at 21:59

How to bulk load data into a dgraph/standalone:graphql container?

1 Answers1