0

I am trying to create DataFrame using spark sqlContext. I have used spark 1.6.3 and scala 2.10.5. Below is my code for creating DataFrames.

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import com.knoldus.pipeline.KMeansPipeLine

object SimpleApp{

    def main(args:Array[String]){

    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)

    import sqlContext.implicits._

    val kMeans = new KMeansPipeLine()
     val df = sqlContext.createDataFrame(Seq(
        ("a@email.com", 12000,"M"),
        ("b@email.com", 43000,"M"),
        ("c@email.com", 5000,"F"),
        ("d@email.com", 60000,"M")
      )).toDF("email", "income","gender")

    val categoricalFeatures = List("gender","email")
    val numberOfClusters = 2
    val iterations = 10
    val predictionResult = kMeans.predict(sqlContext,df,categoricalFeatures,numberOfClusters,iterations)
   }
}

Its giving me the following exception. What mistake I am doing? Can anyone help me resolve this?

 Exception in thread "main" java.lang.NoSuchMethodError:
    org.apache.spark.sql.SQLContext.createDataFrame(Lscala/collection/Seq;Lscala/ref lect/api/TypeTags$TypeTag;)Lorg/apache/spark/sql/Dataset;
    at SimpleApp$.main(SimpleApp.scala:24)
    at SimpleApp.main(SimpleApp.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The dependencies I have used are:

scalaVersion := "2.10.5" 
libraryDependencies ++= Seq( 
 "org.apache.spark" % "spark-core_2.10" % "2.0.0" % "provided", 
 "org.apache.spark" % "spark-sql_2.10" % "2.0.0" % "provided", 
 "org.apache.spark" % "spark-mllib_2.10" % "2.0.0" % "provided", 
 "knoldus" % "k-means-pipeline" % "0.0.1" )
Balkrushn
  • 91
  • 1
  • 1
  • 12
  • Your code works well for me. I'm guessing your Spark binaries were compiled with Scala 2.11 so they can't run with your code using Spark 2.10, the reverse problem of the one described here: http://stackoverflow.com/questions/27728731/scala-code-throw-exception-in-spark – Tzach Zohar Oct 18 '16 at 11:08
  • @TzachZohar how can I resolve this? – Balkrushn Oct 18 '16 at 11:15
  • 2
    First - your dependencies show you're using Spark 2.0.0 not 1.6.3 as you state above. Spark 2.0.0 uses Scala 2.11 by default, as far as I know if you want to use it with Scala 2.10 you'll have to build it yourself, see http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-210. So - either use Scala 2.11 or use a Spark version compiled according to these instructions. – Tzach Zohar Oct 18 '16 at 11:23

1 Answers1

1

As I see in your createDataFrame missed second argument. Method pattern described here: https://spark.apache.org/docs/1.6.1/api/scala/index.html#org.apache.spark.sql.SQLContext@createDataFrame(org.apache.spark.api.java.JavaRDD,%20java.lang.Class)

In your case it will be

def createDataFrame[A <: Product](data: Seq[A])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[A]): DataFrame

:: Experimental :: Creates a DataFrame from a local Seq of Product.

OR Converting Seq into List/RDD and using method pattern with 2 arguments

FaigB
  • 2,271
  • 1
  • 13
  • 22
  • 3
    not true at all - the second argument is implicit, so one shouldn't have to supply it explicitly. The code in question actually _works_ given the correct dependencies. – Tzach Zohar Oct 18 '16 at 11:09
  • Here are my dependencies: – Balkrushn Oct 18 '16 at 11:13
  • scalaVersion := "2.10.5" libraryDependencies ++= Seq( "org.apache.spark" % "spark-core_2.10" % "2.0.0" % "provided", "org.apache.spark" % "spark-sql_2.10" % "2.0.0" % "provided", "org.apache.spark" % "spark-mllib_2.10" % "2.0.0" % "provided", "knoldus" % "k-means-pipeline" % "0.0.1" ) – Balkrushn Oct 18 '16 at 11:14
  • 1
    Please add these to question, not as a comment; – Tzach Zohar Oct 18 '16 at 11:21