1

When we run sqoop import in the GCP dataproc clusters to either avrodatafile or parquetfile it fails with the below errors. However, import to textfile works. Feels like we might need some additional JARs are required.

The required Sqoop jars are loaded from GCS.

COMMAND used:

gcloud dataproc jobs submit hadoop \
            --cluster={cluster_name} \
            --region=us-central1 \
            --class=org.apache.sqoop.Sqoop --jars={sqoop_jars_gcs}/sqoop-1.4.7.jar,{sqoop_jars_gcs}/avro-1.8.2.jar,{sqoop_jars_gcs}/terajdbc4.jar,{sqoop_jars_gcs}/log4j-1.2.17.jar,{sqoop_jars_gcs}/sqoop-connector-teradata-1.2c5.jar,{sqoop_jars_gcs}/tdgssconfig.jar,{sqoop_jars_gcs}/avro-1.8.2.jar \
            -- import \
            -Dmapreduce.job.user.classpath.first=true \
            -Dorg.apache.sqoop.splitter.allow_text_splitter=true \
            --connect={db_connection}DATABASE={source_db} \
            --username={userid} \
            --password-file {passfile}  \
            --driver com.teradata.jdbc.TeraDriver \
            -e "sql query AND \$CONDITIONS" \
            --target-dir=<dir> \
            --delete-target-dir \
            --as-<avrodatafile/parquetfile> \
            --split-by <column>

Error when running --as-avrodatafile: We have the avro-1.8.2.jar in classpath but still no luck.

INFO - Error: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
INFO -      at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:135)
INFO -      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:753)
INFO -      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
INFO -      at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:177)
INFO -      at java.security.AccessController.doPrivileged(Native Method)
INFO -      at javax.security.auth.Subject.doAs(Subject.java:422)
INFO -      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
INFO -      at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:171)
INFO - Caused by: java.lang.reflect.InvocationTargetException
INFO -      at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
INFO -      at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
INFO -      at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
INFO -      at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
INFO -      at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
INFO -      ... 7 more
INFO - Caused by: java.lang.NoClassDefFoundError: org/apache/avro/mapred/AvroWrapper
INFO -      at org.apache.sqoop.mapreduce.AvroImportMapper.<init>(AvroImportMapper.java:43)
INFO -      ... 12 more
INFO - Caused by: java.lang.ClassNotFoundException: org.apache.avro.mapred.AvroWrapper
INFO -      at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
INFO -      at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
INFO -      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
INFO -      at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
INFO -      ... 13 more

Error when running --as-parquetfile:

INFO - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
INFO - at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
INFO - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
INFO - at java.lang.reflect.Method.invoke(Method.java:498)
INFO - at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:19)
INFO - Caused by: java.lang.NoClassDefFoundError: org/kitesdk/data/mapreduce/DatasetKeyOutputFormat
INFO - at org.apache.sqoop.mapreduce.DataDrivenImportJob.getOutputFormatClass(DataDrivenImportJob.java:213)
INFO - at org.apache.sqoop.mapreduce.ImportJobBase.configureOutputFormat(ImportJobBase.java:98)
INFO - at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:263)
INFO - at org.apache.sqoop.manager.SqlManager.importQuery(SqlManager.java:748)
INFO - at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:522)
INFO - at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:628)
INFO - at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
INFO - at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
INFO - at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
INFO - at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
INFO - at org.apache.sqoop.Sqoop.runTool(Sqoop.java:243)
INFO - at org.apache.sqoop.Sqoop.main(Sqoop.java:252)
INFO - ... 5 more
INFO - Caused by: java.lang.ClassNotFoundException: org.kitesdk.data.mapreduce.DatasetKeyOutputFormat
INFO - at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
INFO - at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
INFO - at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
INFO - at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
INFO - ... 17 more
itroulli
  • 2,044
  • 1
  • 10
  • 21
  • I don't know if it's related but hadoop seems to override the classpath somehow. I had to compile an uberjar to get anything to work: https://stackoverflow.com/questions/59738341/sqoop-on-hadoop-nosuchmethoderror-com-google-common-base-stopwatch-createstart – mikebridge Feb 14 '20 at 18:56
  • This may also be relevant, regarding the classpath: https://community.cloudera.com/t5/Support-Questions/Sqoop-import-to-avro-failing-which-jars-to-be-used/m-p/124079 – mikebridge Feb 14 '20 at 18:59

2 Answers2

0

Ran into a similar issue I was able to use the following resources to change the dependency version for kite-sdk to a newer version. I had to download the latest jars for kite-sdk.

https://community.cloudera.com/t5/Support-Questions/Issue-when-using-parquet-org-kitesdk-data/td-p/128233

https://discuss.cloudxlab.com/t/sqoop-import-to-hive-as-parquet-file-is-failing/1089/6

  • Added kite-data-core-1.1.0.jar kite-data-hive-1.1.0.jar kite-data-mapreduce-1.1.0.jar kite-hadoop-compatibility-1.1.0.jar into the classpath but getting org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI pattern: dataset:gs:///folder – user3752941 Oct 02 '19 at 17:16
  • can anybody help? – user3752941 Oct 24 '19 at 14:55
  • Added all sorts of parquet jars but now it's not able to recognise gs://target-dir path. org.kitesdk.data.DatasetNotFoundException: Unknown dataset URI pattern: dataset:gs://aeo_sqoop_dev/test_output/avro/teradata/CRM_CAMPAIGN_RUN Check that JARs for gs datasets are on the classpath at org.kitesdk.data.spi.Registration.lookupDatasetUri(Registration.java:128) at org.kitesdk.data.Datasets.exists(Datasets.java:624) at org.kitesdk.data.Datasets.exists(Datasets.java:646) – user3752941 Oct 24 '19 at 14:56
0

This worked for me only when using a very specific set of jar versions, mostly from Cloudera as follows -

# Jars used:
#   https://repo1.maven.org/maven2/org/apache/parquet/parquet-format/2.9.0/parquet-format-2.9.0.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/sqoop/sqoop/1.4.7.7.2.10.0-148/sqoop-1.4.7.7.2.10.0-148.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/kitesdk/kite-data-core/1.0.0-cdh6.3.4/kite-data-core-1.0.0-cdh6.3.4.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/kitesdk/kite-data-mapreduce/1.0.0-cdh6.3.4/kite-data-mapreduce-1.0.0-cdh6.3.4.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/kitesdk/kite-hadoop-compatibility/1.0.0-cdh6.3.4/kite-hadoop-compatibility-1.0.0-cdh6.3.4.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/avro/avro/1.8.2.7.2.10.0-148/avro-1.8.2.7.2.10.0-148.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/avro/avro-mapred/1.8.2.7.2.10.0-148/avro-mapred-1.8.2.7.2.10.0-148.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-avro/1.10.99.7.2.10.0-148/parquet-avro-1.10.99.7.2.10.0-148.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-common/1.10.99.7.2.10.0-148/parquet-common-1.10.99.7.2.10.0-148.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-column/1.10.99.7.2.10.0-148/parquet-column-1.10.99.7.2.10.0-148.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-hadoop/1.10.99.7.2.10.0-148/parquet-hadoop-1.10.99.7.2.10.0-148.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-jackson/1.10.99.7.2.10.0-148/parquet-jackson-1.10.99.7.2.10.0-148.jar
#   https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/parquet/parquet-encoding/1.10.99.7.2.10.0-148/parquet-encoding-1.10.99.7.2.10.0-148.jar
 
export CLUSTER_NAME=
export CLUSTER_REGION=us-central1 # update accordingly
export GCS_BUCKET=""  # name only
 
export DRIVER_CLASS=com.mysql.jdbc.Driver
export CONNECT_STRING="jdbc:..."
export USERNAME=
 export PASSWORD=""  # testing only - use password-file
export TABLE=
 
export JDBC_JAR=gs://${GCS_BUCKET}/sqoop/jars/mysql-connector-java-5.0.8-bin.jar
 
export SQOOP_JAR=gs://${GCS_BUCKET}/sqoop/jars/cloudera/sqoop-1.4.7.7.2.10.0-148.jar
export AVRO_JAR1=gs://${GCS_BUCKET}/sqoop/jars/cloudera/avro-1.8.2.7.2.10.0-148.jar
export AVRO_JAR2=gs://${GCS_BUCKET}/sqoop/jars/cloudera/avro-mapred-1.8.2.7.2.10.0-148.jar
export PARQUET_JAR1=gs://${GCS_BUCKET}/sqoop/jars/cloudera/kite-data-core-1.0.0-cdh6.3.4.jar
export PARQUET_JAR2=gs://${GCS_BUCKET}/sqoop/jars/cloudera/kite-data-mapreduce-1.0.0-cdh6.3.4.jar
export PARQUET_JAR3=gs://${GCS_BUCKET}/sqoop/jars/cloudera/kite-hadoop-compatibility-1.0.0-cdh6.3.4.jar
export PARQUET_JAR4=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-common-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR5=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-avro-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR6=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-hadoop-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR7=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-column-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR8=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-encoding-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR9=gs://${GCS_BUCKET}/sqoop/jars/cloudera/parquet-jackson-1.10.99.7.2.10.0-148.jar
export PARQUET_JAR10=gs://${GCS_BUCKET}/sqoop/jars/parquet-format-2.9.0.jar
 
gcloud dataproc jobs submit hadoop \
--cluster=${CLUSTER_NAME} \
--class=org.apache.sqoop.Sqoop \
--region=${CLUSTER_REGION} \
--jars=${JDBC_JAR},${SQOOP_JAR},${AVRO_JAR1},${AVRO_JAR2},${PARQUET_JAR1},${PARQUET_JAR2},${PARQUET_JAR3},${PARQUET_JAR4},${PARQUET_JAR5},${PARQUET_JAR6},${PARQUET_JAR7},${PARQUET_JAR8},${PARQUET_JAR9},${PARQUET_JAR10} \
-- import \
-Dmapreduce.job.user.classpath.first=true \
-Dparquetjob.configurator.implementation=hadoop \
--driver ${DRIVER_CLASS} \
--connect=${CONNECT_STRING} \
--username=${USERNAME} \
--password=${PASSWORD} \
--target-dir="gs://${GCS_BUCKET}/sqoop/out/parquet_output4/" \
--table=${TABLE} \
--delete-target-dir \
--as-parquetfile \
-m 1 \
--verbose
 
# --parquet-configurator-implementation kite \
#--compression-codec snappy \
#--query=""
xgMz
  • 3,334
  • 2
  • 30
  • 23