0

I am trying to create an ORC table in Hive by importing from a text file in HDFS. I have tried multiple different ways, searched online for help, and regardless the insert job won't start.

I can get the text file to HDFS, I can read the text file to Hive, but I cannot convert from that to ORC.

I tried many different variations, including this one that can be used as a reference to this question:

http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/moving_data_from_hdfs_to_hive_external_table_method.html

I have a single-node HDP cluster (being used for development) - version:

HDP-2.3.2.0

(2.3.2.0-2950)

And here are the relevant service versions:

Service Version Status Description

HDFS 2.7.1.2.3 Installed Apache Hadoop Distributed File System

MapReduce2 2.7.1.2.3 Installed Apache Hadoop NextGen MapReduce (YARN)

YARN 2.7.1.2.3 Installed Apache Hadoop NextGen MapReduce (YARN)

Tez 0.7.0.2.3 Installed Tez is the next generation Hadoop Query Processing framework written on top of YARN.

Hive 1.2.1.2.3 Installed Data warehouse system for ad-hoc queries & analysis of large datasets and table & storage management service

What happens when I run a SQL like this (again, I've tried many variations including directly from online tutorials):

INSERT OVERWRITE TABLE mycars SELECT * FROM cars;

My job stays like this:

Total number of applications (application-types: [] and states:

[SUBMITTED, ACCEPTED, RUNNING]):1

Application-Id      Application-Name        Application-Type          User       Queue               State         Final-State         Progress                        Tracking-URL

application_1455989658079_0002  HIVE-3f41161c-b806-4e7d-974e-c18e028d683f                    TEZ          hive   root.hive            ACCEPTED           UNDEFINED               0%                                 N/A

And it just hangs there. (Literally, I've tried a 20 row sample table and let it run for hours before killing it).

I am by no means an Hadoop expert (yet) and am sure it's probably a config issue, but I have been unable to figure it out.

All other Hive operations I've tried, such as creating dropping tables, loading a file to a text table, selects, all work fine. It's just when I create an ORC table that it does this. And I need an ORC table for my requirement.

Any advice would be helpful.

SkyWalker
  • 28,384
  • 14
  • 74
  • 132
Tom C
  • 125
  • 3
  • 15
  • Any chance the target table is somehow *locked* by one of your previous attempts, hence blocking all further attempts? Could happen if job is killed brutally and locks not cleared in Metastore. Cf. `show locks ; ` command on Hive prompt. – Samson Scharfrichter Feb 21 '16 at 14:19
  • Ah, forget my previous comment. Your job is in ACCEPTED state in YARN, which means **not enough resources to start that job right now**. How much RAM is available for YARN jobs in your sandbox? And how much RAM is required per TEZ job (1 AppMaster + 1..N executors) cf. `hive-site.xml` and `tez-site.xml` with fallback to `yarn-site.xml` for defaults? Look into https://community.hortonworks.com/articles/14309/demystify-tez-tuning-step-by-step.html – Samson Scharfrichter Feb 21 '16 at 14:32
  • 1
    To investigate further, just disable TEZ and revert to MapReduce to see if the job will have enough resources to start (and how many Mappers can execute in parallel) with `set hive.execution.engine =mr ;` – Samson Scharfrichter Feb 21 '16 at 14:36
  • It didn't occur to me to change the engine to MR (even as a test). That worked. Thanks! I know that's not the preferred method, but it will buy me some time to go through the tuning guide you mentioned and try to resolve the original problem. – Tom C Feb 21 '16 at 20:22
  • By the way, I know it's not table locks because I have been dropping th destination table each time. I would be surprised if it is a true resource problem. Even though it's a dev box and not super powerful, I was testing so far with a max of 20 rows so hard to believe that could affect anything. Must be configuration. The guide you sent should help and that guide referenced a hive tuning guide, so I will work through both of those. Thanks again! – Tom C Feb 21 '16 at 20:37
  • *"a true resource problem"* > YARN does not know how much data will be processed, and does not care. If `tez.am.resource.memory.mb` was mistakenly set to 16GB or some other goofy value, then you may not have enough RAM to allow the job to even initialize... – Samson Scharfrichter Feb 21 '16 at 21:24

1 Answers1

0

Most of the time it has to do with increasing your Yarn Scheduling capacity, but if your resources are already capped you can also reduce the amount of memory requested by individual TEZ tasks, through adjusting the following property in TEZ configuration :

task.resource.memory.mb

In order to increase the Cluster's capacity you can do it in the configuration settings of YARN or directly through Ambari or Cloudera Manager

enter image description here

In order to monitor what is happening behind the hoods you can run Yarn Resource Manager UI and check the diagnostics tab of the specific Application there are useful explicit messages about resource allocation especially when the job is accepted and keeps pending.

enter image description here

Mehdi LAMRANI
  • 11,289
  • 14
  • 88
  • 130