Hive Testbench data generation failed

Question

I cloned the Hive Testbench to try to run Hive benchmark on a hadoop cluster built with Apache binary distributions of Hadoop v2.9.0, Hive 2.3.0 and Tez 0.9.0.

I managed to finish the build of the two data generators: TPC-H and TPC-DS. Then the next step of data generation on either TPC-H and TPC-DS are all failed. The failure is very consistent that each time it would failed at the exactly same step and produce same error messages.

For TPC-H, the data generation screen output is here:

$ ./tpch-setup.sh 10
ls: `/tmp/tpch-generate/10/lineitem': No such file or directory
Generating data at scale factor 10.
...
18/01/02 14:43:00 INFO mapreduce.Job: Running job: job_1514226810133_0050
18/01/02 14:43:01 INFO mapreduce.Job: Job job_1514226810133_0050 running in uber mode : false
18/01/02 14:43:01 INFO mapreduce.Job:  map 0% reduce 0%
18/01/02 14:44:38 INFO mapreduce.Job:  map 10% reduce 0%
18/01/02 14:44:39 INFO mapreduce.Job:  map 20% reduce 0%
18/01/02 14:44:46 INFO mapreduce.Job:  map 30% reduce 0%
18/01/02 14:44:48 INFO mapreduce.Job:  map 40% reduce 0%
18/01/02 14:44:58 INFO mapreduce.Job:  map 70% reduce 0%
18/01/02 14:45:14 INFO mapreduce.Job:  map 80% reduce 0%
18/01/02 14:45:15 INFO mapreduce.Job:  map 90% reduce 0%
18/01/02 14:45:23 INFO mapreduce.Job:  map 100% reduce 0%
18/01/02 14:45:23 INFO mapreduce.Job: Job job_1514226810133_0050 completed successfully
18/01/02 14:45:23 INFO mapreduce.Job: Counters: 0
SLF4J: Class path contains multiple SLF4J bindings.
...
ls: `/tmp/tpch-generate/10/lineitem': No such file or directory
Data generation failed, exiting.

For TPC-DS, the error messages are here:

$ ./tpcds-setup.sh 10
...
18/01/02 22:13:58 INFO Configuration.deprecation: mapred.task.timeout is deprecated. Instead, use mapreduce.task.timeout
18/01/02 22:13:58 INFO client.RMProxy: Connecting to ResourceManager at /192.168.10.15:8032
18/01/02 22:13:59 INFO input.FileInputFormat: Total input files to process : 1
18/01/02 22:13:59 INFO mapreduce.JobSubmitter: number of splits:10
18/01/02 22:13:59 INFO Configuration.deprecation: io.sort.mb is deprecated. Instead, use mapreduce.task.io.sort.mb
18/01/02 22:13:59 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled
18/01/02 22:13:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1514226810133_0082
18/01/02 22:14:00 INFO client.YARNRunner: Number of stages: 1
18/01/02 22:14:00 INFO Configuration.deprecation: mapred.job.map.memory.mb is deprecated. Instead, use mapreduce.map.memory.mb
18/01/02 22:14:00 INFO client.TezClient: Tez Client Version: [ component=tez-api, version=0.9.0, revision=0873a0118a895ca84cbdd221d8ef56fedc4b43d0, SCM-URL=scm:git:https://git-wip-us.apache.org/repos/asf/tez.git, buildTime=2017-07-18T05:41:23Z ]
18/01/02 22:14:00 INFO client.RMProxy: Connecting to ResourceManager at /192.168.10.15:8032
18/01/02 22:14:00 INFO client.TezClient: Submitting DAG application with id: application_1514226810133_0082
18/01/02 22:14:00 INFO client.TezClientUtils: Using tez.lib.uris value from configuration: hdfs://192.168.10.15:8020/apps/tez,hdfs://192.168.10.15:8020/apps/tez/lib/
18/01/02 22:14:00 INFO client.TezClientUtils: Using tez.lib.uris.classpath value from configuration: null
18/01/02 22:14:00 INFO client.TezClient: Tez system stage directory hdfs://192.168.10.15:8020/tmp/hadoop-yarn/staging/rapids/.staging/job_1514226810133_0082/.tez/application_1514226810133_0082 doesn't exist and is created
18/01/02 22:14:01 INFO client.TezClient: Submitting DAG to YARN, applicationId=application_1514226810133_0082, dagName=GenTable+all_10
18/01/02 22:14:01 INFO impl.YarnClientImpl: Submitted application application_1514226810133_0082
18/01/02 22:14:01 INFO client.TezClient: The url to track the Tez AM: http://boray05:8088/proxy/application_1514226810133_0082/
18/01/02 22:14:05 INFO client.RMProxy: Connecting to ResourceManager at /192.168.10.15:8032
18/01/02 22:14:05 INFO mapreduce.Job: The url to track the job: http://boray05:8088/proxy/application_1514226810133_0082/
18/01/02 22:14:05 INFO mapreduce.Job: Running job: job_1514226810133_0082
18/01/02 22:14:06 INFO mapreduce.Job: Job job_1514226810133_0082 running in uber mode : false
18/01/02 22:14:06 INFO mapreduce.Job:  map 0% reduce 0%
18/01/02 22:15:51 INFO mapreduce.Job:  map 10% reduce 0%
18/01/02 22:15:54 INFO mapreduce.Job:  map 20% reduce 0%
18/01/02 22:15:55 INFO mapreduce.Job:  map 40% reduce 0%
18/01/02 22:15:56 INFO mapreduce.Job:  map 50% reduce 0%
18/01/02 22:16:07 INFO mapreduce.Job:  map 60% reduce 0%
18/01/02 22:16:09 INFO mapreduce.Job:  map 70% reduce 0%
18/01/02 22:16:11 INFO mapreduce.Job:  map 80% reduce 0%
18/01/02 22:16:19 INFO mapreduce.Job:  map 90% reduce 0%
18/01/02 22:19:54 INFO mapreduce.Job:  map 100% reduce 0%
18/01/02 22:19:54 INFO mapreduce.Job: Job job_1514226810133_0082 completed successfully
18/01/02 22:19:54 INFO mapreduce.Job: Counters: 0
...
TPC-DS text data generation complete.
Loading text data into external tables.
Optimizing table time_dim (2/24).
Optimizing table date_dim (1/24).
Optimizing table item (3/24).
Optimizing table customer (4/24).
Optimizing table household_demographics (6/24).
Optimizing table customer_demographics (5/24).
Optimizing table customer_address (7/24).
Optimizing table store (8/24).
Optimizing table promotion (9/24).
Optimizing table warehouse (10/24).
Optimizing table ship_mode (11/24).
Optimizing table reason (12/24).
Optimizing table income_band (13/24).
Optimizing table call_center (14/24).
Optimizing table web_page (15/24).
Optimizing table catalog_page (16/24).
Optimizing table web_site (17/24).
make: *** [store_sales] Error 2
make: *** Waiting for unfinished jobs....
make: *** [store_returns] Error 2
Data loaded into database tpcds_bin_partitioned_orc_10.

I notice the targeted temporary HDFS directory during the job running and after the failure are always empty except for the generated sub-directories.

Now I even don't know if the failure is due to Hadoop configuration issues, or mismatch software versions or any other reasons. Any help?

score 0 · Answer 1 · answered Feb 22 '18 at 04:12

0

I had similar issue when running this job. When I specified the hdfs location to this script where I had permissions to write to, the script was successful.

./tpcds-setup.sh 10 <hdfs_directory_path>

I still get this error when the script kicks off:

Data loaded into database tpcds_bin_partitioned_orc_10.
ls: `<hdfs_directory_path>/10': No such file or directory

However the script runs successfully and the data is generated and loaded into the hive tables at the end.

Hope that helps.

answered Feb 22 '18 at 04:12

Yogita D

1

Your error will appear if the data staging directory is not there when the script is executed. The script will create the directory for you after reporting this error. What type of HDFS do you use? is it HDP or Apache? Do you see the TPCDS ORC tables are created inside Hive with data populated? – robert Feb 23 '18 at 02:21

Hive Testbench data generation failed

1 Answers1