1

My goal is importing 25M edges in the graph which has about 50M vertices. Target time:

The current speed of importing is ~150 edges/sec. Speed on remote connection was about 100 edges/sec.

  • extracted 20,694,336 rows (171 rows/sec) - 20,694,336 rows -> loaded 20,691,830 vertices (171 vertices/sec) Total time: 35989762ms [0 warnings, 4 errors]
  • extracted 20,694,558 rows (156 rows/sec) - 20,694,558 rows -> loaded 20,692,053 vertices (156 vertices/sec) Total time: 35991185ms [0 warnings, 4 errors]
  • extracted 20,694,745 rows (147 rows/sec) - 20,694,746 rows -> loaded 20,692,240 vertices (147 vertices/sec) Total time: 35992453ms [0 warnings, 4 errors]
  • extracted 20,694,973 rows (163 rows/sec) - 20,694,973 rows -> loaded 20,692,467 vertices (162 vertices/sec) Total time: 35993851ms [0 warnings, 4 errors]
  • extracted 20,695,179 rows (145 rows/sec) - 20,695,179 rows -> loaded 20,692,673 vertices (145 vertices/sec) Total time: 35995262ms [0 warnings, 4 errors]

I tried to enable parallel in etl config, but looks like it is completely broken in Orient 2.2.12 (Inconsistency with multi-threading changes in 2.1?) and gives me nothing but 4 errors in the log above. Dumb parallel mode (running 2+ ETL processes) also impossible for plocal connection.

My config:

{
"config": {
    "log": "info",
    "parallel": true
},
"source": {
    "input": {}
},
"extractor": {
    "row": {
        "multiLine": false
    }
},
"transformers": [
    {
          "code": {
            "language": "Javascript",
              "code": "(new com.orientechnologies.orient.core.record.impl.ODocument()).fromJSON(input);"
        }
    },
    {
        "merge": {
            "joinFieldName": "_ref",
            "lookup": "Company._ref"
        }
    },
    {
        "vertex": {
            "class": "Company",
            "skipDuplicates": true
        }
    },
    {
        "edge": {
            "joinFieldName": "with_id",
            "lookup": "Person._ref",
            "direction": "in",
            "class": "Stakeholder",
            "edgeFields": {
                "_ref": "${input._ref}",
                "value_of_share": "${input.value_of_share}"
            },
            "skipDuplicates": true,
            "unresolvedLinkAction": "ERROR"
        }
    },
    {
        "field": {
            "fieldNames": [
                "with_id",
                "with_to",
                "_type",
                "value_of_share"
            ],
            "operation": "remove"
        }
    }
],
"loader": {
    "orientdb": {
        "dbURL": "plocal:/mnt/disks/orientdb/orientdb-2.2.12/databases/df",
        "dbUser": "admin",
        "dbPassword": "admin",
        "dbAutoDropIfExists": false,
        "dbAutoCreate": false,
        "standardElementConstraints": false,
        "tx": false,
        "wal": false,
        "batchCommit": 1000,
        "dbType": "graph",
        "classes": [
            {
                "name": "Company",
                "extends": "V"
            },
            {
                "name": "Person",
                "extends": "V"
            },
            {
                "name": "Stakeholder",
                "extends": "E"
            }
            ]
        }
    }
}

Data sample:

{"_ref":"1072308006473","with_to":"person","with_id":"010703814320","_type":"is.stakeholder","value_of_share":10000.0} {"_ref":"1075837000095","with_to":"person","with_id":"583600656732","_type":"is.stakeholder","value_of_share":15925.0} {"_ref":"1075837000095","with_to":"person","with_id":"583600851010","_type":"is.stakeholder","value_of_share":33150.0}

Server's specs are: instance on Google Cloud, PD-SSD, 6CPU, 18GB RAM.

Btw, on the same server I managed to get ~3k/sec on importing vertices using remote connection (it is still too slow, but acceptable for my current dataset).

And the question: is it any reliable way to increase speed of importing to let's say 10k inserts per second, or at least 5k? I wouldn't like to turn off indexes, it is still millions of records, not billions.

UPDATE

After few hours the performance continue to deteriorate.

  • extracted 23,146,912 rows (56 rows/sec) - 23,146,912 rows -> loaded 23,144,406 vertices (56 vertices/sec) Total time: 60886967ms [0 warnings, 4 errors]
  • extracted 23,146,981 rows (69 rows/sec) - 23,146,981 rows -> loaded 23,144,475 vertices (69 vertices/sec) Total time: 60887967ms [0 warnings, 4 errors]
  • extracted 23,147,075 rows (39 rows/sec) - 23,147,075 rows -> loaded 23,144,570 vertices (39 vertices/sec) Total time: 60890356ms [0 warnings, 4 errors]
Eugene
  • 448
  • 4
  • 12

0 Answers0