0

I have a Hadoop+Hive+Tez setup from scratch (meaning I deployed it component by component). Hive is set up using Tez as execution engine.

In its current status, Hive can access table on HDFS, but it can not access table stored on MinIO (using s3a filesystem implementation).

As shows the following screenshot, enter image description here when executing SELECT COUNT(*) FROM s3_table,

  • Tez execution stuck forever
  • Map 1 always in INITIALIZING state
  • Map 1 always has a total count of -1 and pending count of -1. (why -1?)

Things already checked:

  • Hadoop can access MinIO/S3 without problem. For example, hdfs dfs -ls s3a://bucketname works well.
  • Hive-on-Tez can compute against tables on HDFS, with mappers and reducers generated successfully and quickly.
  • Hive-on-MR can compute against tables on MinIO/S3 without problem.

What could be the possible causes for this problem?

Attaching Tez UI screenshot: enter image description here

Version informations:

  • Hadoop 3.2.1
  • Hive 3.1.2
  • Tez 0.9.2
  • MinIO RELEASE.2020-01-25T02-50-51Z
Naitree
  • 1,088
  • 1
  • 14
  • 23

2 Answers2

0

It turned out the problem is that Tez's S3 support must be enabled explicitly at compile time. For hadoop 2.8+, to enable S3 support, Tez must be compiled from source, with the following command:

mvn clean package -DskipTests=true -Dmaven.javadoc.skip=true -Paws -Phadoop28 -P\!hadoop27

After that, drop the generated tez-x.y.z.tar.gz to HDFS and extract tez-x.x.x-minimal.tar.gz to $TEZ_LIB_DIR. Then it worked for me. Hive execution against MinIO/S3 runs smoothly.

However, Tez installation guide didn't mention anything about enabling S3 support. Nor does the default Tez binary releases build with S3 or Azure support.

The (hopefully) complete build options and pitfalls are actually documented in BUILDING.txt, where it says:

However, to build against hadoop versions higher than 2.7.0, you will need to do the following:

For Hadoop version X where X >= 2.8.0

$ mvn package  -Dhadoop.version=${X} -Phadoop28 -P\!hadoop27

For recent versions of Hadoop (which do not bundle aws and azure by default), you can bundle AWS-S3 (2.7.0+) or Azure (2.7.0+) support:

$ mvn package -Dhadoop.version=${X} -Paws -Pazure
Naitree
  • 1,088
  • 1
  • 14
  • 23
  • "you can bundle AWS-S3 (2.7.0+) or Azure (2.7.0+) support:". Don't mix any release of the hadoop-aws and hadoop-azure JARs with other hadoop-common JAR versions, or any AWS SDK other than what they shipped with. That way leads to stack traces. A full build is the way to go -and you have described it well – stevel Apr 24 '20 at 17:48
0

My team faced a similar issue but while reading from HDFS instead, the map phase was forever stuck at Initializing.

This may help somebody else facing a similar problem. Actually the application-master was getting Out-of-Memory. Increasing the below value to 12 GB worked for us.

tez.am.resource.memory.mb

Abhishek
  • 143
  • 10