Specifying additional jars in AWS EMR custom jar application

Question

I am trying to run a hadoop job on an EMR cluster. It is being run as a Java command for which I use a jar-with-dependencies. The job pulls data from Teradata and I am assuming Teradata related jars are also packed within the jar-with-dependencies. However, I am still getting an exception:

Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver
at org.apache.hadoop.mapreduce.lib.db.DBInputFormat.setConf(DBInputFormat.java:171)

My pom has the following relevant dependencies:

<dependency>
  <groupId>teradata</groupId>
  <artifactId>terajdbc4</artifactId>
  <version>14.10.00.17</version>
</dependency>

<dependency>
  <groupId>teradata</groupId>
  <artifactId>tdgssconfig</artifactId>
  <version>14.10.00.17</version>
</dependency>

I am packaging the full jar as under:

  <build>
    <plugins>
      <plugin>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.1</version>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
          <compilerArgument>-Xlint:-deprecation</compilerArgument>
        </configuration>
      </plugin>

      <plugin>
        <artifactId>maven-assembly-plugin</artifactId>
        <version>2.2.1</version>

        <configuration>
          <descriptors>
          </descriptors>
          <archive>
            <manifest>
            </manifest>
          </archive>
          <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
          </descriptorRefs>
        </configuration>

        <executions>
          <execution>
            <id>make-assembly</id>
            <phase>package</phase>
            <goals>
              <goal>single</goal>
            </goals>
          </execution>
        </executions>
      </plugin>

    </plugins>
  </build>

assembly.xml file:

<assembly>
    <id>aws-emr</id>
    <formats>
        <format>jar</format>
    </formats>
    <includeBaseDirectory>false</includeBaseDirectory>
    <dependencySets>
        <dependencySet>
            <unpack>false</unpack>
            <includes>
            </includes>
            <scope>runtime</scope>
            <outputDirectory>lib</outputDirectory>
        </dependencySet>
        <dependencySet>
            <unpack>true</unpack>
            <includes>
                <include>${groupId}:${artifactId}</include>
            </includes>
        </dependencySet>
    </dependencySets>
</assembly>

Running the EMR command as:

aws emr create-cluster --release-label emr-5.3.1 \
--instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \
    InstanceGroupType=CORE,InstanceCount=5,BidPrice=0.1,InstanceType=m3.xlarge \
--service-role EMR_DefaultRole --log-uri s3://my-bucket/logs \
--applications Name=Hadoop --name TeradataPullerTest \
--ec2-attributes <ec2-attributes> \

--steps Type=CUSTOM_JAR,Name=EventsPuller,Jar=s3://path-to-jar-with-dependencies.jar,\
Args=[com.my.package.EventsPullerMR],ActionOnFailure=TERMINATE_CLUSTER \
--auto-terminate

Is there a way I can specify the Teradata jars such that they are added to the classpath while executing the map-reduce job?

EDIT: I verified that the missing class is packaged in the jar-with-dependencies.

aws-emr$ jar tf target/aws-emr-0.0.1-SNAPSHOT-jar-with-dependencies.jar | grep TeraDriver
com/ncr/teradata/TeraDriver.class
com/teradata/jdbc/TeraDriver.class

score 0 · Answer 1 · answered Mar 15 '17 at 02:36

I haven't completely resolved this issue yet but found a way to make this work. Ideal solution should have packed the teradata jars within the uber jar. That is still happening but those jars somehow don't get added to the classpath. I am not sure why that is the case.

I resolved this by creating 2 separate jars - one for my code package and the other for all the dependencies needed. I uploaded both those jars to S3 and then wrote a script which does the following (pseudo-code):

# download main jar
aws s3 cp <s3-path-to-myjar.jar> .

# download dependency jar in a temp directory
aws s3 cp <s3-path-to-dependency-jar> temp

# unzip the dependencies jar into another directory (say `jars`)
unzip -j temp/dependencies.jar <path-within-jar-to-unzip>/* -d jars

LIBJARS=`find jars/*.jar | tr -s '\n' ','`

HADOOP_CLASSPATH=`echo ${LIBJARS} | sed s/,/:/g`

CLASSPATH=$HADOOP_CLASSPATH

export CLASSPATH HADOOP_CLASSPATH

# run via hadoop command
hadoop jar myjar.jar com.my.package.EventsPullerMR -libjars ${LIBJARS} <arguments to the job>

This kicks off the job.

Specifying additional jars in AWS EMR custom jar application

1 Answers1