I am open to Spark. Do you have some sample code?
Disclaimer: I just assembled some snippets from multiple notebooks, and was too lazy (and hungry) to launch a test-run before leaving the office. Any bugs and typos are yours to find.
Using Spark 2.0 available from Cloudera parcel
(with Hive support), an interactive-style Scala script, in local mode, without any data partitioning, a Microsoft SQL Server connection, and inserting directly into an existing Hive managed table (with some extra business logic)...
spark2-shell --master local --driver-class-path /some/path/to/sqljdbc42.jar
// Side note: automatic registration of type 4 JDBC drivers is broken in multiple Spark builds, and the bug keeps on reappearing, so it's safer to specify the driver class just in case...
val weather = spark.read.format("jdbc").option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver").option("url", "jdbc:sqlserver://myhost\\SQLExpress:9433;database=mydb").option("user", "mylogin").option("password", "*****").option("dbtable", "weather_obs").load()
{ printf( "%%% Partitions: %d / Records: %d\n", weather.rdd.getNumPartitions, weather.count)
println("%%% Detailed DF schema:")
weather.printSchema
}
// Alternative for "dbtable"
using a sub-query :
// "(SELECT station, dt_obs_utc, temp_k FROM observation_meteo WHERE station LIKE '78%') x")
weather.registerTempTable("wth")
spark.sql(
"""
INSERT INTO TABLE somedb.sometable
SELECT station, dt_obs_utc, CAST(temp_k -273.15 AS DECIMAL(3,1)) as temp_c
FROM wth
WHERE temp_k IS NOT NULL
""")
dropTempTable("wth")
weather.unpersist()
Now, if you want to dynamically create a Hive external table on a Parquet file with GZip compression, remplace the "temp table" trick with...
weather.write.option("compression","gzip").mode("overwrite").parquet("hdfs:///some/directory/")
// supported compression codecs for Parquet: none, snappy (default), gzip
// supported compression codecs for CSV: none (default), snappy, lz4, gzip, bzip2
def toImpalaType(sparkType : String ) : String = {
if (sparkType == "StringType" || sparkType == "BinaryType") { return "string" }
if (sparkType == "BooleanType") { return "boolean" }
if (sparkType == "ByteType") { return "tinyint" }
if (sparkType == "ShortType") { return "smallint" }
if (sparkType == "IntegerType") { return "int" }
if (sparkType == "LongType") { return "bigint" }
if (sparkType == "FloatType") { return "float" }
if (sparkType == "DoubleType") { return "double" }
if (sparkType.startsWith("DecimalType")) { return sparkType.replace("DecimalType","decimal") }
if (sparkType == "TimestampType" || sparkType == "DateType") { return "timestamp" }
println("########## ERROR - \"" +sparkType +"\" not supported (bug)")
return "string"
}
spark.sql("DROP TABLE IF EXISTS somedb.sometable")
{ val query = new StringBuilder
query.append("CREATE EXTERNAL TABLE somedb.sometable")
val weatherSchema =weather.dtypes
val (colName0,colType0) = weatherSchema(0)
query.append("\n ( " +colName0 + " " +toImpalaType(colType0))
for ( i <- 2 to tempSchema.length) { val (colName_,colType_) = tempSchema(i-1) ; query.append("\n , " +colName_ + " " +toImpalaType(colType_)) }
query.append("\n )\nCOMMENT 'Imported from SQL Server by Spark'")
query.append("\nSTORED AS Parquet")
query.append("\nLOCATION 'hdfs:///some/directory'")
sqlContext.sql(query.toString())
query.clear()
}
If you want to partition your input table (based on a numeric column - date/time not supported AFAIK) then look into the JDBC import options
partitionColumn
,
lowerBound
and
upperBound
.
If you want to load these partitions in parallel in YARN-client mode, then add a --jars
argument to upload the JDBC driver to the executors.