1

I am working to sync a very large database to hive.

There are 2 issues: (1) Text imports are slower and there is a slow large mapreduce step. (2) Sequence files are much faster but are not readable by normal means.

Details follow:

(1) If we import the data as text, it is slower. The files accumulate in the home directory in a temp folder but eventually a mapreduce job is created which is rather slow.

17/04/25 04:18:34 INFO mapreduce.Job: Job job_1490822567992_0996 running in uber mode : false
17/04/25 04:18:34 INFO mapreduce.Job:  map 0% reduce 0%
17/04/25 11:05:59 INFO mapreduce.Job:  map 29% reduce 0%
17/04/25 11:20:18 INFO mapreduce.Job:  map 86% reduce 0% <-- tends to hang a very long time here

(A lot of lines deleted for brevity.)

(2) If we import the files as sequencefiles, it is much faster but the retrieved data is unreadable by Hive since it needs to know about the autogenerated Java files that get created. This also has a mapreduce step, but it seemed to go faster (or perhaps that was a time of day thing...).

We have a series of these classes for each table that are produced by sqoop: public class MyTableName extends SqoopRecord implements DBWritable, Writable

What are the steps for using those classes? How do we install them in hive? Surprisingly, the Cloudera support engineer doesn't know, as this must be infrequently charted territory??

sqoop import-all-tables --connect '...' --relaxed-isolation --num-mappers 7 --compress --autoreset-to-one-mapper --compression-codec=snappy --outdir javadir --as-sequencefile --hive-delims-replacement ' '

Any advice?

Chris
  • 1,219
  • 2
  • 11
  • 21
  • 1
    this is related to http://stackoverflow.com/questions/42050548/loading-sequence-file-data-into-hive-table-created-using-stored-as-sequence-file – Chris Apr 25 '17 at 22:16
  • _"sequencefiles ... the retrieved data is unreadable by Hive since it needs to know about the autogenerated Java files that get created"_ >> what is that nonsense?? Hive simply requires an appropriate `CREATE TABLE` command to know about the SequenceFile structure. That's the purpose of `--hive-import` cf. http://stackoverflow.com/questions/31515498/when-to-use-sqoop-create-hive-table – Samson Scharfrichter Apr 26 '17 at 01:53
  • Additionally, since Sqoop generates a MapReduce job, using Snappy *(or LZ4)* compression for intermediate files and Snappy *(or LZ4, or GZip)* compression for final files could make a significant difference in performance; cf. `mapreduce.map.output.compress*` and `mapreduce.output.fileoutputformat.compress*` in – Samson Scharfrichter Apr 26 '17 at 01:59
  • Hive import commands are not available for sequence files. – Chris Apr 26 '17 at 05:02
  • Did you consider using a Spark script as a replacement for Sqoop *(which, after all, was never more than a stop-gap solution sponsored by Cloudera in the pre-Spark era)*? As a bonus you could get compressed Parquet files as output. – Samson Scharfrichter Apr 27 '17 at 20:12
  • The Cloudera people still push for Sqoop but I am open to Spark. Do you have some sample code? – Chris Apr 28 '17 at 21:07
  • Got some PySpark 1.6 snippets in a Notebook somewhere in the office; will search out for it when WE is over (i.e. on Tuesday cf. Labor Day on May 1st). Big tables might need special care about JDBC partitioning rule and Spark memory settings though. BTW, did you run some benchmarks on a Sqoop TextFile import w/ different compression settings? – Samson Scharfrichter Apr 28 '17 at 22:39

1 Answers1

1

I am open to Spark. Do you have some sample code?

Disclaimer: I just assembled some snippets from multiple notebooks, and was too lazy (and hungry) to launch a test-run before leaving the office. Any bugs and typos are yours to find.


Using Spark 2.0 available from Cloudera parcel (with Hive support), an interactive-style Scala script, in local mode, without any data partitioning, a Microsoft SQL Server connection, and inserting directly into an existing Hive managed table (with some extra business logic)...
spark2-shell --master local --driver-class-path /some/path/to/sqljdbc42.jar

// Side note: automatic registration of type 4 JDBC drivers is broken in multiple Spark builds, and the bug keeps on reappearing, so it's safer to specify the driver class just in case...

val weather = spark.read.format("jdbc").option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver").option("url", "jdbc:sqlserver://myhost\\SQLExpress:9433;database=mydb").option("user", "mylogin").option("password", "*****").option("dbtable", "weather_obs").load()
{ printf( "%%% Partitions: %d / Records: %d\n", weather.rdd.getNumPartitions, weather.count)
  println("%%% Detailed DF schema:")
  weather.printSchema
}

// Alternative for "dbtable" using a sub-query :
// "(SELECT station, dt_obs_utc, temp_k FROM observation_meteo WHERE station LIKE '78%') x")

weather.registerTempTable("wth")
spark.sql(
    """
    INSERT INTO TABLE somedb.sometable
    SELECT station, dt_obs_utc, CAST(temp_k -273.15 AS DECIMAL(3,1)) as temp_c
    FROM wth
    WHERE temp_k IS NOT NULL
    """)
dropTempTable("wth")

weather.unpersist()


Now, if you want to dynamically create a Hive external table on a Parquet file with GZip compression, remplace the "temp table" trick with...
weather.write.option("compression","gzip").mode("overwrite").parquet("hdfs:///some/directory/")

// supported compression codecs for Parquet: none, snappy (default), gzip
// supported compression codecs for CSV: none (default), snappy, lz4, gzip, bzip2

def toImpalaType(sparkType : String ) : String = {
  if (sparkType == "StringType" || sparkType == "BinaryType")  { return "string" }
  if (sparkType == "BooleanType")                              { return "boolean" }
  if (sparkType == "ByteType")                                 { return "tinyint" }
  if (sparkType == "ShortType")                                { return "smallint" }
  if (sparkType == "IntegerType")                              { return "int" }
  if (sparkType == "LongType")                                 { return "bigint" }
  if (sparkType == "FloatType")                                { return "float" }
  if (sparkType == "DoubleType")                               { return "double" }
  if (sparkType.startsWith("DecimalType"))                     { return sparkType.replace("DecimalType","decimal") }
  if (sparkType == "TimestampType" || sparkType == "DateType") { return "timestamp" }
  println("########## ERROR - \"" +sparkType +"\" not supported (bug)")
  return "string"
}

spark.sql("DROP TABLE IF EXISTS somedb.sometable")
{ val query = new StringBuilder
  query.append("CREATE EXTERNAL TABLE somedb.sometable")
  val weatherSchema =weather.dtypes
  val (colName0,colType0) = weatherSchema(0)
  query.append("\n ( " +colName0 + " " +toImpalaType(colType0))
  for ( i <- 2 to tempSchema.length) { val (colName_,colType_) = tempSchema(i-1) ; query.append("\n , " +colName_ + " " +toImpalaType(colType_)) }
  query.append("\n )\nCOMMENT 'Imported from SQL Server by Spark'")
  query.append("\nSTORED AS Parquet")
  query.append("\nLOCATION 'hdfs:///some/directory'")
  sqlContext.sql(query.toString())
  query.clear()
}


If you want to partition your input table (based on a numeric column - date/time not supported AFAIK) then look into the JDBC import options partitionColumn, lowerBoundand upperBound.

If you want to load these partitions in parallel in YARN-client mode, then add a --jars argument to upload the JDBC driver to the executors.

Samson Scharfrichter
  • 8,884
  • 1
  • 17
  • 36