Am trying to JAR a simple scala application which make use of SparlCSV and spark sql to create a Data frame of the CSV file stored in HDFS and then just make a simple query to return the Max and Min of specific column in CSV file.
I am getting error when i use the sbt command to create the JAR which later i will curl to jobserver /jars folder and execute from remote machine
Code:
import com.typesafe.config.{Config, ConfigFactory}
import org.apache.spark.SparkContext._
import org.apache.spark._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object sparkSqlCSV extends SparkJob {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[4]").setAppName("sparkSqlCSV")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val config = ConfigFactory.parseString("")
val results = runJob(sc, config)
println("Result is " + results)
}
override def validate(sc: sqlContext, config: Config): SparkJobValidation = {
SparkJobValid
}
override def runJob(sc: sqlContext, config: Config): Any = {
val value = "com.databricks.spark.csv"
val ControlDF = sqlContext.load(value,Map("path"->"hdfs://mycluster/user/Test.csv","header"->"true"))
ControlDF.registerTempTable("Control")
val aggDF = sqlContext.sql("select max(DieX) from Control")
aggDF.collectAsList()
}
}
Error:
[hduser@ptfhadoop01v spark-jobserver]$ sbt ashesh-jobs/package
[info] Loading project definition from /usr/local/hadoop/spark-jobserver/project
Missing bintray credentials /home/hduser/.bintray/.credentials. Some bintray features depend on this.
Missing bintray credentials /home/hduser/.bintray/.credentials. Some bintray features depend on this.
Missing bintray credentials /home/hduser/.bintray/.credentials. Some bintray features depend on this.
Missing bintray credentials /home/hduser/.bintray/.credentials. Some bintray features depend on this.
[info] Set current project to root (in build file:/usr/local/hadoop/spark-jobserver/)
[info] scalastyle using config /usr/local/hadoop/spark-jobserver/scalastyle-config.xml
[info] Processed 2 file(s)
[info] Found 0 errors
[info] Found 0 warnings
[info] Found 0 infos
[info] Finished in 9 ms
[success] created output: /usr/local/hadoop/spark-jobserver/ashesh-jobs/target
[warn] Credentials file /home/hduser/.bintray/.credentials does not exist
[info] Updating {file:/usr/local/hadoop/spark-jobserver/}ashesh-jobs...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] scalastyle using config /usr/local/hadoop/spark-jobserver/scalastyle-config.xml
[info] Processed 5 file(s)
[info] Found 0 errors
[info] Found 0 warnings
[info] Found 0 infos
[info] Finished in 1 ms
[success] created output: /usr/local/hadoop/spark-jobserver/job-server-api/target
[info] Compiling 2 Scala sources and 1 Java source to /usr/local/hadoop/spark-jobserver/ashesh-jobs/target/scala-2.10/classes...
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:8: object sql is not a member of package org.apache.spark
[error] import org.apache.spark.sql.SQLContext
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:14: object sql is not a member of package org.apache.spark
[error] val sqlContext = new org.apache.spark.sql.SQLContext(sc)
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:25: not found: type sqlContext
[error] override def runJob(sc: sqlContext, config: Config): Any = {
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:21: not found: type sqlContext
[error] override def validate(sc: sqlContext, config: Config): SparkJobValidation = {
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:27: not found: value sqlContext
[error] val ControlDF = sqlContext.load(value,Map("path"->"hdfs://mycluster/user/Test.csv","header"->"true"))
[error] ^
[error] /usr/local/hadoop/spark-jobserver/ashesh-jobs/src/spark.jobserver/sparkSqlCSV.scala:29: not found: value sqlContext
[error] val aggDF = sqlContext.sql("select max(DieX) from Control")
[error] ^
[error] 6 errors found
[error] (ashesh-jobs/compile:compileIncremental) Compilation failed
[error] Total time: 10 s, completed May 26, 2016 4:42:52 PM
[hduser@ptfhadoop01v spark-jobserver]$
I guess the main issue being that its missing the dependencies for sparkCSV and sparkSQL , But i have no idea where to place the dependencies before compiling the code using sbt.
I am issuing the following command to package the application , The source codes are placed under "ashesh_jobs" directory
[hduser@ptfhadoop01v spark-jobserver]$ sbt ashesh-jobs/package
I hope someone can help me to resolve this issue.Can you specify me the file where i can specify the dependency and the format to input