0

I am trying to create a simple workflow with a hive action. I'm using Cloudera Quickstart VM (CDH 5.12). The following are the components of my workflow:

1) top_n_products.hql

create table instacart.top_n as
(
select * from
(
select row_number() over (order by no_of_times_ordered desc)as num_rank, product_id, product_name, no_of_times_ordered 
from
(
select A.product_id, B.product_name, count(*) as no_of_times_ordered from 
instacart.order_products__train as A
left outer join
instacart.products as B
on A.product_id=B.product_id
group by A.product_id, B.product_name
)C
)D
where num_rank <= ${N}
);

2) hive-config.xml

I have basically copied the default hive-site.xml from /etc/hive/conf into my workflow workspace folder and renamed it to hive-config.xml

  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
  </property>

  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>cloudera</value>
  </property>

  <property>
    <name>hive.hwi.war.file</name>
    <value>/usr/lib/hive/lib/hive-hwi-0.8.1-cdh4.0.0.jar</value>
    <description>This is the WAR file with the jsp content for Hive Web Interface</description>
  </property>

  <property>
    <name>datanucleus.fixedDatastore</name>
    <value>true</value>
  </property>

  <property>
    <name>datanucleus.autoCreateSchema</name>
    <value>false</value>
  </property>

  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://127.0.0.1:9083</value>
    <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
  </property>
</configuration>

3) Workflow properties

In the hive action, I set the following: - set HIVE XML, Job XML paths to my hive-config.xml - Also added hive-config.xml to Files - In the workflow properties, set the path to my workspace - Defined the parameter N in my query

Screenshot of my Hive Action properties

When I try to run the workflow it fails, and the stderr throws following error:

Log Type: stderr

            Log Upload Time: Mon Nov 20 19:49:04 -0800 2017

            Log Length: 2759
          SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/filecache/130/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Nov 20, 2017 7:47:34 PM com.google.inject.servlet.InternalServletModule$BackwardsCompatibleServletContextProvider get
WARNING: You are attempting to use a deprecated API (specifically, attempting to @Inject ServletContext inside an eagerly created singleton. While we allow this for backwards compatibility, be warned that this MAY have unexpected behavior if you have more than one injector (with ServletModule) running in the same JVM. Please consult the Guice documentation at http://code.google.com/p/google-guice/wiki/Servlets for more information.
Nov 20, 2017 7:47:35 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
.
.
.
.
INFO: Binding org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices to GuiceManagedComponentProvider with the scope "PerRequest"
log4j:WARN No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Below are the workflow.xml and job.properties that are generated:

1) Workflow XML:

<workflow-app name="Top_N_Products" xmlns="uri:oozie:workflow:0.5">
  <global>
      <job-xml>hive-config.xml</job-xml>
  </global>
    <start to="hive-87ac"/>
    <kill name="Kill">
        <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <action name="hive-87ac" cred="hcat">
        <hive xmlns="uri:oozie:hive-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
              <job-xml>hive-config.xml</job-xml>
            <script>top_n_products.hql</script>
              <param>N={N}</param>
            <file>hive-config.xml#hive-config.xml</file>
        </hive>
        <ok to="End"/>
        <error to="Kill"/>
    </action>
    <end name="End"/>
</workflow-app>

2) job.properties

security_enabled=False
send_email=False
dryrun=False
nameNode=hdfs://quickstart.cloudera:8020
jobTracker=localhost:8032
N=10

Please note that the hive query runs perfectly fine through the Hive query editor. Am I missing something while configuring the workflow? Any help is appreciated!

Thanks, Deb

Debasmita
  • 1
  • 3
  • That's not an error... – OneCricketeer Nov 28 '17 at 03:58
  • And you can save the Hive query as a Hue document, then use that in oozie. You shouldn't need an hql file, or a hive xml (oozie can read that on its own) – OneCricketeer Nov 28 '17 at 04:01
  • Instead of adding the file to the section try passing it in via the tag. Make sure the filename is hive-site.xml as Spark looks for that file on the classpath. – gezapeti Dec 01 '17 at 16:28
  • @gezapeti : I'm not using spark. Simply using a HIVE action inside Oozie – Debasmita Dec 04 '17 at 03:30
  • If it doesn't run from Oozie, it should not run as a standalone query... You've not specified if it ran originally – OneCricketeer Dec 04 '17 at 05:09
  • @cricket_007: Okay, I saved my Hive query and used that directly in oozie. In the Hive action properties, I also set HiveServer2 URL to jdbc:hive2://localhost:10000 and gave the password cloudera - which is the default pwd for quickstart vm. However, stderr still throws same warning/error and my new table is not being created. I feel like I'm not able to see the actual error somehow because its not being logged correctly. Is there any other setting that I need to do? – Debasmita Dec 04 '17 at 05:13
  • @cricket_007: it runs from HIVE client – Debasmita Dec 04 '17 at 05:15
  • See if this helps https://stackoverflow.com/a/40411366/2308683 – OneCricketeer Dec 04 '17 at 05:19
  • 1
    @cricket_007: Thanks a lot for the prompt response :) I got it to work. It seems the issue was whitespace in the saved hive query. I removed the indentation from the hive query and it ran smoothly. – Debasmita Dec 04 '17 at 05:36
  • Awesome. Feel free to answer your own question below – OneCricketeer Dec 04 '17 at 05:40
  • @Debasmita, this sound like a bug. If you have the time, please open an Oozie ticket with a more detailed description so it can be fixed. – gezapeti Dec 28 '17 at 22:25

0 Answers0