I get an intermittent FileNotFoundException error when I run a query in Hive using the Tez engine.
ERROR : Status: Failed
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1508808910527_45616_1_00, diagnostics=[Task failed, taskId=task_1508808910527_45616_1_00_000066, diagnostics=[TaskAttempt 0 failed, info=[Container container_e09_1508808910527_45616_01_000033 finished with diagnostics set to [Container failed, exitCode=-1000. File does not exist: hdfs://server02.corp.company.com:8020/tmp/hive/username/_tez_session_dir/b65ddde9-110e-47fc-ae1c-33a1f754f839/nzcodec.jar
java.io.FileNotFoundException: File does not exist: hdfs://server02.corp.company.com:8020/tmp/hive/username/_tez_session_dir/b65ddde9-110e-47fc-ae1c-33a1f754f839/nzcodec.jar
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The query selects data from a staging table, repartitions it and writes it to a reporting table.
INSERT OVERWRITE TABLE ${reporting_table} PARTITION (day, app_name) select <all the fields> from ${staging_table} where day = '${day}'
The staged data is stored in Avro files is 350GB
hadoop fs -du -h -s /staged-data/2017-11-02
350.7 G /staged-data/2017-11-02
I've run the same query on the same set of data multiple times and the failure is intermittent.
My yarn settings look like this:
yarn.nodemanager.resource.memory-mb 83968
yarn.scheduler.minimum-allocation-mb 2048
My Tez settings on the query look like this:
SET hive.execution.engine=tez;
SET tez.am.resource.memory.mb=2048;
SET hive.tez.container.size=2048;
SET hive.merge.tezfiles=true;
SET hive.merge.smallfiles.avgsize=128000000;
SET hive.merge.size.per.task=128000000;
I've worked through the suggestions on https://community.hortonworks.com/articles/14309/demystify-tez-tuning-step-by-step.html but I'm still seeing this problem. Adjusting the container size doesn't seem to help.
Is there another set of settings I can modify to prevent this?