Error occurred when writing edge data from Spark to NebulaGraph Database

Question

The NebulaGraph Database version is 3.4.0. It is deployed in a distributed manner and installed via RPM. It is already in production environment. There are three machines with CPU(24) and memory 251.

The problem is that Spark can read CSV data normally, but when importing to NebulaGraph Database, it fails for large data volumes, while there is no problem for small data volumes of the same type of file.

There is no error message output in the NebulaGraph Database service log. And Spark has already read the data.

 2023-06-21 09:15:29 [main] INFO  org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 10.784938 ms
+-------------+-------------------+------+
|        srcId|              dstId|  name|
+-------------+-------------------+------+
|DLGB200406020|1669305794820182017|关键词|
|NRPJ201924137|1669305794820182018|关键词|
|ZRZK201708018|1669305794820182018|关键词|
|WFXY202005026|1669305794820182018|关键词|
|DDLY201805226|1669305794820182018|关键词|
|NRPJ202004149|1669305794820182018|关键词|
|TXWL202022095|1669305794820182018|关键词|
|SHLG202202017|1669305794820182018|关键词|
|QYZL201951056|1669305794820182018|关键词|
|XDBY201712089|1669305794820182018|关键词|
|ZXQX201807030|1669305794820182018|关键词|
|DYKJ201834091|1669305794820182018|关键词|
|DDLY201912060|1669305794820182018|关键词|
|JMSJ201812256|1669305794820182018|关键词|
|CUYN201912046|1669305794820182018|关键词|
|SJSM201806254|1669305794820182018|关键词|
|WHYK201810064|1669305794820182018|关键词|
|SXZX200405018|1669305794820182019|关键词|
|GSKJ200711116|1669305794820182020|关键词|
|NMGS2004S2040|1669305794820182021|关键词|
+-------------+-------------------+------+
only showing top 20 rows

Error occurred when writing to NebulaGraph Database

2023-06-21 09:25:20 [task-result-getter-0] WARN  org.apache.spark.scheduler.TaskSetManager - Lost task 26.0 in stage 4.0 (TID 30) (10.27.107.33 executor 4): java.lang.NullPointerException: Cannot invoke "Object.toString()" because the return value of "org.apache.spark.sql.catalyst.InternalRow.get(int, org.apache.spark.sql.types.DataType)" is null
        at com.vesoft.nebula.connector.writer.NebulaExecutor$.extraID(NebulaExecutor.scala:57)
        at com.vesoft.nebula.connector.writer.NebulaEdgeWriter.write(NebulaEdgeWriter.scala:57)
        at com.vesoft.nebula.connector.writer.NebulaEdgeWriter.write(NebulaEdgeWriter.scala:17)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:442)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1538)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:480)
        at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:381)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:833)

Error occurred when writing edge data from Spark to NebulaGraph Database

0 Answers0