I am trying to run a hadoop job on an EMR cluster. It is being run as a Java command for which I use a jar-with-dependencies
. The job pulls data from Teradata and I am assuming Teradata related jars are also packed within the jar-with-dependencies. However, I am still getting an exception:
Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver
at org.apache.hadoop.mapreduce.lib.db.DBInputFormat.setConf(DBInputFormat.java:171)
My pom
has the following relevant dependencies:
<dependency>
<groupId>teradata</groupId>
<artifactId>terajdbc4</artifactId>
<version>14.10.00.17</version>
</dependency>
<dependency>
<groupId>teradata</groupId>
<artifactId>tdgssconfig</artifactId>
<version>14.10.00.17</version>
</dependency>
I am packaging the full jar as under:
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<compilerArgument>-Xlint:-deprecation</compilerArgument>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.2.1</version>
<configuration>
<descriptors>
</descriptors>
<archive>
<manifest>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
assembly.xml
file:
<assembly>
<id>aws-emr</id>
<formats>
<format>jar</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<dependencySets>
<dependencySet>
<unpack>false</unpack>
<includes>
</includes>
<scope>runtime</scope>
<outputDirectory>lib</outputDirectory>
</dependencySet>
<dependencySet>
<unpack>true</unpack>
<includes>
<include>${groupId}:${artifactId}</include>
</includes>
</dependencySet>
</dependencySets>
</assembly>
Running the EMR command as:
aws emr create-cluster --release-label emr-5.3.1 \
--instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \
InstanceGroupType=CORE,InstanceCount=5,BidPrice=0.1,InstanceType=m3.xlarge \
--service-role EMR_DefaultRole --log-uri s3://my-bucket/logs \
--applications Name=Hadoop --name TeradataPullerTest \
--ec2-attributes <ec2-attributes> \
--steps Type=CUSTOM_JAR,Name=EventsPuller,Jar=s3://path-to-jar-with-dependencies.jar,\
Args=[com.my.package.EventsPullerMR],ActionOnFailure=TERMINATE_CLUSTER \
--auto-terminate
Is there a way I can specify the Teradata jars such that they are added to the classpath while executing the map-reduce job?
EDIT: I verified that the missing class is packaged in the jar-with-dependencies.
aws-emr$ jar tf target/aws-emr-0.0.1-SNAPSHOT-jar-with-dependencies.jar | grep TeraDriver
com/ncr/teradata/TeraDriver.class
com/teradata/jdbc/TeraDriver.class