aws neptune bulk load parallelization

Question

I am trying to insert 624,118,983 records divided into 1000 files, it takes 35 hours to get loaded all into neptune which is very slow. I have configured db.r5.large instance with 2 instatnce. I have 1000 files stored in S3 bucket. I have one loading request pointing to S3 bucket folder which has 1000 files. when i get the load status I get below response.

{
    "status" : "200 OK",
    "payload" : {
        "feedCount" : [
            {
                "LOAD_NOT_STARTED" : 640
            },
            {
                "LOAD_IN_PROGRESS" : 1
            },
            {
                "LOAD_COMPLETED" : 358
            },
            {
                "LOAD_FAILED" : 1
            }
        ],
        "overallStatus" : {
            "fullUri" : "s3://myntriplesfiles/ntriple-folder/",
            "runNumber" : 1,
            "retryNumber" : 0,
            "status" : "LOAD_IN_PROGRESS",
            "totalTimeSpent" : 26870,
            "startTime" : 1639289761,
            "totalRecords" : 224444549,
            "totalDuplicates" : 17295821,
            "parsingErrors" : 1,
            "datatypeMismatchErrors" : 0,
            "insertErrors" : 0
        }
    }

I see here is that LOAD_IN_PROGRESS is always 1. that means neptune is not trying to load mutiple files in parallelization. How do i tell neptune to load 1000 file in some parallelization for example parallelization factor of 10. Am i missing any configuration?

This is how I use bulk load api.

curl -X POST -H 'Content-Type: application/json' https://neptune-hostname:8182/loader -d '
{
"source" : "s3://myntriplesfiles/ntriple-folder/",
"format" : "nquads",
"iamRoleArn" : "my aws arn values goes here",
"region" : "us-east-2",
"failOnError" : "FALSE",
"parallelism" : "HIGH",
"updateSingleCardinalityProperties" : "FALSE",
"queueRequest" : "FALSE"
}'

Please advice.

Kelvin Lawrence · Accepted Answer · 2021-12-14T16:07:10.873

0

The Amazon Neptune bulk loader does not load multiple files in parallel, but does divide up the contents of each file among the number of available worker threads on the writer instance (limited by how you have the parallelism property set on the load command). If you have no other writes pending during the load period you can set that field to OVERSUBSCRIBE which will use all available worker threads. Secondly, larger files are better than smaller files as that gives the worker threads more that they can do in parallel. Thirdly, using a larger writer instance just for the duration of the load will provide a lot more worker threads that can take on load tasks. The number of worker threads available in an instance is approximately twice the number of vCPU the instance has. Quite often, people will use something like an db-r5-12xl just for the bulk load (for large loads) and then scale that back to something a lot smaller for regular query workloads.

edited Dec 14 '21 at 16:07

answered Dec 14 '21 at 15:59

Kelvin Lawrence

14,674
2
16
38

Thanks for your answer, if i submit multiple bulk load request with pointing to file instead of submitting one request pointing to bucket . will it help load data faster? for example submit 10 files at a time, wait for completion and submit another 10 files. – Jigar Gajjar Dec 14 '21 at 16:31
Neptune will try to load vertices and then edges but always one file at a time so it should not matter. If you have files you would like loaded first you do have the option to queue them in a specific order. – Kelvin Lawrence Dec 14 '21 at 16:33
I do not have any specific order, and I am loading ntriples. – Jigar Gajjar Dec 14 '21 at 16:55

score 0 · Answer 2 · answered Sep 05 '22 at 08:11

In Addition to the above, Gzip compressing the files would help faster network reads. Neptune, by default understands gzip compressed files. Also queueRequest: TRUE can be set to achieve better results. Neptune can queue up to 64 requests. Instead of sending only one request you can trigger multiple files in parallel. You can even configure dependencies among the files if you have to. Ref: https://docs.aws.amazon.com/neptune/latest/userguide/load-api-reference-load.html You need to move to a bigger writer instance only in cases where CPU usage is consistently higher than 60%.

aws neptune bulk load parallelization

2 Answers2