0

I was practising developing sample model using online resources provided in spark website. I managed to create the model and run it for sample data using Spark-Shell , But how to do actually run the model in production environment ? Is it via Spark Job server ?

import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint  
import org.apache.spark.mllib.linalg.Vectors

val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
val parsedData = data.map { line =>
  val parts = line.split(',')
  LabeledPoint(parts.last.toDouble,     Vectors.dense(parts.take(9).map(_.toDouble)))
}
var svm = new SVMWithSGD().setIntercept(true)
val model = svm.run(parsedData)
var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
println(predictedValue)

The above code works perfect when i run it in spark-shell , But i have no idea how do we actually run model in production environment. I tried to run it via spark jobserver but i get error ,

curl -d "input.string = 1, 2, 3, 4, 5, 6, 7, 8, 9" 'ptfhadoop01v:8090/jobs?appName=SQL&classPath=spark.jobserver.SparkPredict'

I am sure its because am passing a String value whereas the program expects it be vector elements , Can someone guide me on how to achieve this . And also is this how the data being passed to Model in production environment ? Or is it some other way.

Community
  • 1
  • 1
Ashesh Nair
  • 317
  • 5
  • 21

1 Answers1

2

Spark Job-server is used in production use-cases, where you want to design pipelines of Spark jobs, and also (optionally) use the SparkContext across jobs, over a REST API. Sparkplug is an alternative to Spark Job-server, providing similar constructs.

However, to answer your question on how to run a (singular) Spark job in production environments, the answer is you do not need a third-party library to do so. You only need to construct a SparkContext object, and use it to trigger Spark jobs. For instance, for your code snippet, all that is needed is;

package runner

import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors

import com.typesafe.config.{ConfigFactory, Config}
import org.apache.spark.{SparkConf, SparkContext}
/**
 * 
 */
object SparkRunner {

  def main (args: Array[String]){

    val config: Config = ConfigFactory.load("app-default-config") /*Use a library to read a config file*/
    val sc: SparkContext = constructSparkContext(config)

    val data = sc.textFile("hdfs://mycluster/user/Cancer.csv")
    val parsedData = data.map { line =>
      val parts = line.split(',')
      LabeledPoint(parts.last.toDouble, Vectors.dense(parts.take(9).map(_.toDouble)))
    }
    var svm = new SVMWithSGD().setIntercept(true)
    val model = svm.run(parsedData)
    var predictedValue = model.predict(Vectors.dense(5,1,1,1,2,1,3,1,1))
    println(predictedValue)
  }


  def constructSparkContext(config: Config): SparkContext = {
    val conf = new SparkConf()
    conf
      .setMaster(config.getString("spark.master"))
      .setAppName(config.getString("app.name"))
    /*Set more configuration values here*/

    new SparkContext(conf)
  }


}

Optionally, you can also use the wrapper for spark-submit script, SparkSubmit, provided in the Spark library itself.

suj1th
  • 1,781
  • 2
  • 14
  • 22
  • Hi sujith , thanks a lot for the clarification and sample code. I believe my inference on using spark-job server for deploying the Model built in production environment was correct. But i still have ton's of questions which i believe will get clear as i dig deeper into this. For now lets say i am deploying this piece of code as a Spark Jar and i want to run the code via spark Job server in a remote machine. Any Hint on how can i pass the input string as a vector or convert the string into vector which i can use to predict the output an return the result. In short how can i pass new data – Ashesh Nair Jul 19 '16 at 13:34
  • @AsheshNair The REST api provided by spark-jobserver is intended for 'managing' Spark jobs, and as such, inputs to the jobs are not passed as parameters to the REST calls. Only a POST entity, which is a Typesafe Config format file is expected; it is merged with the job-server's config file at startup. – suj1th Jul 19 '16 at 14:04
  • @AsheshNair The usual production scenario is that any input that a Spark job requires is either read from a database/HDFS store or is read from a configuration file. – suj1th Jul 19 '16 at 14:07