I'm trying to run my Giraph-based application on a Hadoop cluster through YARN. The command I use is
yarn jar solver-1.0-SNAPSHOT.jar edu.agh.iga.adi.giraph.IgaSolverTool
First I need to copy that JAR to one of the directories that are reported when issuing yarn classpath
. Just to be sure, changing file privileges to 777.
I obviously need to ship that JAR to the workers so I do:
conf.setYarnLibJars(currentJar());
In the code where currentJar()
is:
private static String currentJar() {
return new File(IgaGiraphJobFactory.class.getProtectionDomain()
.getCodeSource()
.getLocation()
.getPath()).getName();
}
This users the JAR name which seems to be fine as the application no longer crashes fast (if anything else was used it would). Instead, it takes around 10 minutes after which a failure is reported. There is an error in the logs:
LogType:gam-stderr.log
LogLastModifiedTime:Sat Sep 14 13:24:52 +0000 2019
LogLength:2122
LogContents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/nm-local-dir/usercache/kbhit/appcache/application_1568451681492_0016/filecache/11/solver-1.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "pool-6-thread-2" java.lang.IllegalStateException: Could not configure the containerlaunch context for GiraphYarnTasks.
at org.apache.giraph.yarn.GiraphApplicationMaster.getTaskResourceMap(GiraphApplicationMaster.java:391)
at org.apache.giraph.yarn.GiraphApplicationMaster.access$500(GiraphApplicationMaster.java:78)
at org.apache.giraph.yarn.GiraphApplicationMaster$LaunchContainerRunnable.buildContainerLaunchContext(GiraphApplicationMaster.java:522)
at org.apache.giraph.yarn.GiraphApplicationMaster$LaunchContainerRunnable.run(GiraphApplicationMaster.java:479)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://iga-adi-m/user/yarn/giraph_yarn_jar_cache/application_1568451681492_0016/solver-1.0-SNAPSHOT.jar
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1533)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1526)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1541)
at org.apache.giraph.yarn.YarnUtils.addFileToResourceMap(YarnUtils.java:153)
at org.apache.giraph.yarn.YarnUtils.addFsResourcesToMap(YarnUtils.java:77)
at org.apache.giraph.yarn.GiraphApplicationMaster.getTaskResourceMap(GiraphApplicationMaster.java:387)
... 6 more
End of LogType:gam-stderr.log.This log file belongs to a running container (container_1568451681492_0016_01_000001) and so may not be complete.
Which causes class not found errors (GiraphYarnTask) in the worker containers.
Seems that for some reason the JAR doesn't get transferred to HDFS along with the config (which is). What might be the reason for that?
Also, it seems that the JAR is getting sent
1492_0021/solver-1.0-SNAPSHOT.jar, packetSize=65016, chunksPerPacket=126, bytesCurBlock=73672704
2019-09-14 14:08:26,252 DEBUG [DFSOutputStream] - enqueue full packet seqno: 1142 offsetInBlock: 73672704 lastPacketInBlock: false lastByteOffsetInBlock: 73737216, src=/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar, bytesCurBlock=73737216, blockSize=134217728, appendChunk=false, blk_1073741905_1081@[DatanodeInfoWithStorage[10.164.0.6:9866,DS-2d8f815f-1e64-4a7f-bbf6-0c91ebc613d7,DISK], DatanodeInfoWithStorage[10.164.0.7:9866,DS-6a606f45-ffb7-449f-ab8b-57d5950d5172,DISK]]
2019-09-14 14:08:26,252 DEBUG [DataStreamer] - Queued packet 1142
2019-09-14 14:08:26,253 DEBUG [DataStreamer] - DataStreamer block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081 sending packet packet seqno: 1142 offsetInBlock: 73672704 lastPacketInBlock: false lastByteOffsetInBlock: 73737216
2019-09-14 14:08:26,253 DEBUG [DFSClient] - computePacketChunkSize: src=/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar, chunkSize=516, chunksPerPacket=126, packetSize=65016
2019-09-14 14:08:26,253 DEBUG [DFSClient] - DFSClient writeChunk allocating new packet seqno=1143, src=/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar, packetSize=65016, chunksPerPacket=126, bytesCurBlock=73737216
2019-09-14 14:08:26,253 DEBUG [DataStreamer] - DFSClient seqno: 1141 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 323347 flag: 0 flag: 0
2019-09-14 14:08:26,253 DEBUG [DataStreamer] - DFSClient seqno: 1142 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 326916 flag: 0 flag: 0
2019-09-14 14:08:26,254 DEBUG [DataStreamer] - Queued packet 1143
2019-09-14 14:08:26,256 DEBUG [DataStreamer] - DataStreamer block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081 sending packet packet seqno: 1143 offsetInBlock: 73737216 lastPacketInBlock: false lastByteOffsetInBlock: 73771432
2019-09-14 14:08:26,256 DEBUG [DataStreamer] - Queued packet 1144
2019-09-14 14:08:26,257 DEBUG [DataStreamer] - Waiting for ack for: 1144
2019-09-14 14:08:26,257 DEBUG [DataStreamer] - DFSClient seqno: 1143 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 497613 flag: 0 flag: 0
2019-09-14 14:08:26,257 DEBUG [DataStreamer] - DataStreamer block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081 sending packet packet seqno: 1144 offsetInBlock: 73771432 lastPacketInBlock: true lastByteOffsetInBlock: 73771432
2019-09-14 14:08:26,263 DEBUG [DataStreamer] - DFSClient seqno: 1144 reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 2406978 flag: 0 flag: 0
2019-09-14 14:08:26,263 DEBUG [DataStreamer] - Closing old block BP-308761091-10.164.0.5-1568451675362:blk_1073741905_1081
2019-09-14 14:08:26,264 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit sending #12 org.apache.hadoop.hdfs.protocol.ClientProtocol.complete
2019-09-14 14:08:26,266 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit got value #12
2019-09-14 14:08:26,267 DEBUG [ProtobufRpcEngine] - Call: complete took 4ms
2019-09-14 14:08:26,267 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit sending #13 org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo
2019-09-14 14:08:26,268 DEBUG [Client] - IPC Client (743080989) connection to iga-adi-m/10.164.0.5:8020 from kbhit got value #13
2019-09-14 14:08:26,268 DEBUG [ProtobufRpcEngine] - Call: getFileInfo took 1ms
2019-09-14 14:08:26,269 INFO [YarnUtils] - Registered file in LocalResources :: hdfs://iga-adi-m/user/kbhit/giraph_yarn_jar_cache/application_1568451681492_0021/solver-1.0-SNAPSHOT.jar
but once I inspect the contents it's empty
2019-09-14 14:16:42,795 DEBUG [ProtobufRpcEngine] - Call: getListing took 6ms
Found 1 items
-rw-r--r-- 2 yarn hadoop 187800 2019-09-14 14:08 hdfs://iga-adi-m/user/yarn/giraph_yarn_jar_cache/application_1568451681492_0021/giraph-conf.xml
Meanwhile, if I just copy manually the jar to that directory (predicting it's name) everything works as expected. What is wrong?
I think it might be connected to this GIRAPH-859