0

I'm trying to run my Giraph-based application on a Hadoop cluster through YARN. The command I use is

yarn jar solver-1.0-SNAPSHOT.jar edu.agh.iga.adi.giraph.IgaSolverTool

First I need to copy that JAR to one of the directories that are reported when issuing yarn classpath. Just to be sure, changing file privileges to 777.

I obviously need to ship that JAR to the workers so I do:

conf.setYarnLibJars(currentJar());

In the code where currentJar() is:

  private static String currentJar() {
    return new File(IgaGiraphJobFactory.class.getProtectionDomain()
        .getCodeSource()
        .getLocation()
        .getPath()).getName();
  }

This users the JAR name which seems to be fine as the application no longer crashes fast (if anything else was used it would). Instead, it takes around 10 minutes after which a failure is reported. There is an error in the logs:

LogType:gam-stderr.log
LogLastModifiedTime:Sat Sep 14 13:24:52 +0000 2019
LogLength:2122
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/nm-local-dir/usercache/kbhit/appcache/application_1568451681492_0016/filecache/11/solver-1.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "pool-6-thread-2" java.lang.IllegalStateException: Could not configure the containerlaunch context for GiraphYarnTasks.
    at org.apache.giraph.yarn.GiraphApplicationMaster.getTaskResourceMap(GiraphApplicationMaster.java:391)
    at org.apache.giraph.yarn.GiraphApplicationMaster.access$500(GiraphApplicationMaster.java:78)
    at org.apache.giraph.yarn.GiraphApplicationMaster$LaunchContainerRunnable.buildContainerLaunchContext(GiraphApplicationMaster.java:522)
    at org.apache.giraph.yarn.GiraphApplicationMaster$LaunchContainerRunnable.run(GiraphApplicationMaster.java:479)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://iga-adi-m/user/yarn/giraph_yarn_jar_cache/application_1568451681492_0016/solver-1.0-SNAPSHOT.jar
    at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1533)
    at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1526)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1541)
    at org.apache.giraph.yarn.YarnUtils.addFileToResourceMap(YarnUtils.java:153)
    at org.apache.giraph.yarn.YarnUtils.addFsResourcesToMap(YarnUtils.java:77)
    at org.apache.giraph.yarn.GiraphApplicationMaster.getTaskResourceMap(GiraphApplicationMaster.java:387)
    ... 6 more
End of LogType:gam-stderr.log.This log file belongs to a running container (container_1568451681492_0016_01_000001) and so may not be complete.

Which causes class not found errors (GiraphYarnTask) in the worker containers.

Seems that for some reason the JAR doesn't get transferred to HDFS along with the config (which is). What might be the reason for that?

Also, it seems that the JAR is getting sent

1492_0021/solver-1.0-SNAPSHOT.jar, packetSize=65016, chunksPerPacket=126, bytesCurBlock=73672704
2019-09-14 14:08:26,252 DEBUG [DFSOutputStream] - enqueue full packet seqno: 1142 offsetInBlock: 73672704 lastPacketInBlock: false lastByteOffsetInBlock: 73737216, src=/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar, bytesCurBlock=73737216, blockSize=134217728, appendChunk=false, blk_1073741905_1081@[DatanodeInfoWithStorage[10.164.0.6:9866,DS-2d8f815f-1e64-4a7f-bbf6-0c91ebc613d7,DISK], DatanodeInfoWithStorage[10.164.0.7:9866,DS-6a606f45-ffb7-449f-ab8b-57d5950d5172,DISK]]
2019-09-14 14:08:26,252 DEBUG [DataStreamer] - Queued packet 1142
2019-09-14 14:08:26,253 DEBUG [DataStreamer] - DataStreamer block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081 sending packet packet seqno: 1142 offsetInBlock: 73672704 lastPacketInBlock: false lastByteOffsetInBlock: 73737216
2019-09-14 14:08:26,253 DEBUG [DFSClient] - computePacketChunkSize: src=/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar, chunkSize=516, chunksPerPacket=126, packetSize=65016
2019-09-14 14:08:26,253 DEBUG [DFSClient] - DFSClient writeChunk allocating new packet seqno=1143, src=/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar, packetSize=65016, chunksPerPacket=126, bytesCurBlock=73737216
2019-09-14 14:08:26,253 DEBUG [DataStreamer] - DFSClient seqno: 1141 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 323347 flag: 0 flag: 0
2019-09-14 14:08:26,253 DEBUG [DataStreamer] - DFSClient seqno: 1142 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 326916 flag: 0 flag: 0
2019-09-14 14:08:26,254 DEBUG [DataStreamer] - Queued packet 1143
2019-09-14 14:08:26,256 DEBUG [DataStreamer] - DataStreamer block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081 sending packet packet seqno: 1143 offsetInBlock: 73737216 lastPacketInBlock: false lastByteOffsetInBlock: 73771432
2019-09-14 14:08:26,256 DEBUG [DataStreamer] - Queued packet 1144
2019-09-14 14:08:26,257 DEBUG [DataStreamer] - Waiting for ack for: 1144
2019-09-14 14:08:26,257 DEBUG [DataStreamer] - DFSClient seqno: 1143 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 497613 flag: 0 flag: 0
2019-09-14 14:08:26,257 DEBUG [DataStreamer] - DataStreamer block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081 sending packet packet seqno: 1144 offsetInBlock: 73771432 lastPacketInBlock: true lastByteOffsetInBlock: 73771432
2019-09-14 14:08:26,263 DEBUG [DataStreamer] - DFSClient seqno: 1144 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 2406978 flag: 0 flag: 0
2019-09-14 14:08:26,263 DEBUG [DataStreamer] - Closing old block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081
2019-09-14 14:08:26,264 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit sending #12 org.apache.hadoop.hdfs.protocol.ClientProtocol.complete
2019-09-14 14:08:26,266 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit got value #12
2019-09-14 14:08:26,267 DEBUG [ProtobufRpcEngine] - Call: complete took 4ms
2019-09-14 14:08:26,267 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit sending #13 org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo
2019-09-14 14:08:26,268 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit got value #13
2019-09-14 14:08:26,268 DEBUG [ProtobufRpcEngine] - Call: getFileInfo took 1ms
2019-09-14 14:08:26,269 INFO  [YarnUtils] - Registered file in LocalResources :: hdfs://iga-adi-m/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar

but once I inspect the contents it's empty

2019-09-14 14:16:42,795 DEBUG [ProtobufRpcEngine] - Call: getListing took 6ms
Found 1 items
-rw-r--r--   2 yarn hadoop     187800 2019-09-14 14:08 hdfs://iga-adi-m/user/yarn/giraph_yarn_jar_cache/application_1568451681492_0021/giraph-conf.xml

Meanwhile, if I just copy manually the jar to that directory (predicting it's name) everything works as expected. What is wrong?

I think it might be connected to this GIRAPH-859

kboom
  • 2,279
  • 3
  • 28
  • 43

2 Answers2

0

It seems that even if Giraph maintainers claim that it can run in YARN mode it is not really true. There are a number of bugs which make it difficult unless you know what is the root cause, like in this case.

The cause here is that when Giraph is sending the jars to the HDFS from where there should be accessible to the workers it uses one location to upload and another to download, hence workers cannot find the file. This happens if we use a user different than yarn to launch the application - probably a fairly common assumption.

There are 3 workarounds, neither is ideal (some might not be applicable):

  • just to run the application using yarn user
  • upload the jars manually before each computation (note that you have to make sure you are uploading to the new directory (just increment the job number) - also remember that you have to create that directory first
  • apply this patch and build against this version of Giraph

Tested all three, all work.

kboom
  • 2,279
  • 3
  • 28
  • 43
0

I got a similar error:

    20/03/04 09:40:10 ERROR yarn.GiraphYarnTask: GiraphYarnTask threw a top-level exception, failing task
java.lang.RuntimeException: run() caught an unrecoverable IOException.
    at org.apache.giraph.yarn.GiraphYarnTask.run(GiraphYarnTask.java:97)
    at org.apache.giraph.yarn.GiraphYarnTask.main(GiraphYarnTask.java:183)
Caused by: java.io.FileNotFoundException: File hdfs://localhost:9000/user/schramml/_bsp/_defaultZkManagerDir/giraph_yarn_application_1583310839052_0001 does not exist.
    at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:993)
    at org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:118)
    at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1053)
    at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1050)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1050)
    at org.apache.giraph.zk.ZooKeeperManager.getServerListFile(ZooKeeperManager.java:346)
    at org.apache.giraph.zk.ZooKeeperManager.getZooKeeperServerList(ZooKeeperManager.java:376)
    at org.apache.giraph.zk.ZooKeeperManager.setup(ZooKeeperManager.java:190)
    at org.apache.giraph.graph.GraphTaskManager.startZooKeeperManager(GraphTaskManager.java:449)
    at org.apache.giraph.graph.GraphTaskManager.setup(GraphTaskManager.java:251)
    at org.apache.giraph.yarn.GiraphYarnTask.run(GiraphYarnTask.java:91)
    ... 1 more

But the reason in my case was that I used an aggregatorWriter and had to delete the file of the Writer from the previous run. There was also an file already exist error in another container, but at first I found this question and maybe this information helps someone else.

Runms
  • 58
  • 7