0

I have a vmware cloudera image, cdh-5.7 running with centos6.8, i am using OS X as my development machine, and the cdh image to run the code.

UPDATE

This is the build.sbt that i am currently using, i just have updated spark version from official (1.6.1) to 1.6.0-cdh5.7.0:

[cloudera@quickstart awesome-recommendation-engine]$ cat build.sbt 
name := "my-recommendation-spark-engine"

version := "1.0-SNAPSHOT"

scalaVersion := "2.10.4"

val sparkVersion = "1.6.0-cdh5.7.0"

val akkaVersion = "2.3.11" // override Akka to be this version to match the one in Spark

libraryDependencies ++= Seq(
  "org.apache.kafka" % "kafka_2.10" % "0.8.1"
      exclude("javax.jms", "jms")
      exclude("com.sun.jdmk", "jmxtools")
      exclude("com.sun.jmx", "jmxri"),
   // HTTP client to request data to Amazon
   "net.databinder.dispatch" %% "dispatch-core" % "0.11.1",
   // HTML parser
   "org.jodd" % "jodd-lagarto" % "3.5.2",
   "com.typesafe" % "config" % "1.2.1",
   "com.typesafe.play" % "play-json_2.10" % "2.4.0-M2",
   "org.scalatest" % "scalatest_2.10" % "2.2.1" % "test",
   "org.twitter4j" % "twitter4j-core" % "4.0.2",
   "org.twitter4j" % "twitter4j-stream" % "4.0.2",
   "org.codehaus.jackson" % "jackson-core-asl" % "1.6.1",
   "org.scala-tools.testing" % "specs_2.8.0" % "1.6.5" % "test",
   "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.0-cdh5.7.0",
   "org.apache.spark" % "spark-core_2.10" % "1.6.0-cdh5.7.0",
   "org.apache.spark" % "spark-streaming_2.10" % "1.6.0-cdh5.7.0",
   "org.apache.spark" % "spark-sql_2.10" % "1.6.0-cdh5.7.0",
   "org.apache.spark" % "spark-mllib_2.10" % "1.6.0-cdh5.7.0",
   "com.google.code.gson" % "gson" % "2.6.2",
   "commons-cli" % "commons-cli" % "1.3.1",
   "com.stratio.datasource" % "spark-mongodb_2.10" % "0.11.1",
   // Akka
   "com.typesafe.akka" %% "akka-actor" % akkaVersion,
   "com.typesafe.akka" %% "akka-slf4j" % akkaVersion,
   // MongoDB
   "org.reactivemongo" %% "reactivemongo" % "0.10.0"
)

packAutoSettings

resolvers ++= Seq(
  "JBoss Repository" at "http://repository.jboss.org/nexus/content/repositories/releases/",
  "Spray Repository" at "http://repo.spray.cc/",
  "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/",
  "Akka Repository" at "http://repo.akka.io/releases/",
  "Twitter4J Repository" at "http://twitter4j.org/maven2/",
  "Apache HBase" at "https://repository.apache.org/content/repositories/releases",
  "Twitter Maven Repo" at "http://maven.twttr.com/",
  "scala-tools" at "https://oss.sonatype.org/content/groups/scala-tools",
  "Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/",
  "Second Typesafe repo" at "http://repo.typesafe.com/typesafe/maven-releases/",
  "Mesosphere Public Repository" at "http://downloads.mesosphere.io/maven",
  Resolver.sonatypeRepo("public")
)

This is my /etc/hosts file located in the cdh image with a line like this:

127.0.0.1       quickstart.cloudera     quickstart      localhost       localhost.domain

The cloudera version that i am running is:

[cloudera@quickstart bin]$ cat /usr/lib/hadoop/cloudera/cdh_version.properties

# Autogenerated build properties
version=2.6.0-cdh5.7.0
git.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a
cloudera.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a
cloudera.cdh.hash=e7465a27c5da4ceee397421b89e924e67bc3cbe1
cloudera.cdh-packaging.hash=8f9a1632ebfb9da946f7d8a3a8cf86efcdccec76
cloudera.base-branch=cdh5-base-2.6.0
cloudera.build-branch=cdh5-2.6.0_5.7.0
cloudera.pkg.version=2.6.0+cdh5.7.0+1280
cloudera.pkg.release=1.cdh5.7.0.p0.92
cloudera.cdh.release=cdh5.7.0
cloudera.build.time=2016.03.23-18:30:29GMT

I can do a ls command in the vmware machine:

[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/ratings.csv
-rw-r--r-- 1 cloudera cloudera 16906296 2016-05-30 11:29 /user/cloudera/ratings.csv

I can read its content:

[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/ratings.csv | wc -l
568454

The code is quite simple, just trying to map its content:

val ratingFile="hdfs://192.168.30.139:8020/user/cloudera/ratings.csv"
//where 192.168.30.139 is the eth0 assigned ip of cloudera image
case class AmazonRating(userId: String, productId: String, rating: Double)

val NumRecommendations = 10
val MinRecommendationsPerUser = 10
val MaxRecommendationsPerUser = 20
val MyUsername = "myself"
val NumPartitions = 20

println("Using this ratingFile: " + ratingFile)
  // first create an RDD out of the rating file
val rawTrainingRatings = sc.textFile(ratingFile).map {
    line =>
      val Array(userId, productId, scoreStr) = line.split(",")
      AmazonRating(userId, productId, scoreStr.toDouble)
}

// only keep users that have rated between MinRecommendationsPerUser and MaxRecommendationsPerUser products
val trainingRatings = rawTrainingRatings.groupBy(_.userId).filter(r => MinRecommendationsPerUser <= r._2.size  && r._2.size < MaxRecommendationsPerUser).flatMap(_._2).repartition(NumPartitions).cache()

println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings out of ${rawTrainingRatings.count()}")

I am getting this message:

**06/01/2016 17:20:04 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources**

UPDATE

06/02/2016: I have increased memory (8GB) and available cores (4) to the vmware image, but the same exception than above happens. The file that i am trying to load from HDFS is only 16MB, it cannot be a matter of available resources!

If i update /etc/hosts file with this line:

192.168.30.139 quickstart.cloudera  quickstart  localhost   localhost.domain

instead of 

[cloudera@quickstart bin]$ cat /etc/hosts
127.0.0.1   quickstart.cloudera quickstart  localhost   localhost.domain 

where 192.168.30.139 is the actual assigned ip, i get this exception:

Caused by: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1
because if i run the exact code within the spark-shell, i got this message:
Parsed hdfs://192.168.30.139:8020/user/cloudera/ratings.csv. Kept 73279 ratings out of 568454

Why is it working fine within the spark-shell but it is not programmatically running in the vmware image?

UPDATE

I am running the code using sbt-pack plugin to generate unix commands and run them within the vmware image which has the spark pseudocluster,

This is the code i use to instantiate the sparkconf:

val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")
                               .setMaster("spark://192.168.30.139:7077")                                       .set("spark.driver.allowMultipleContexts", "true")
    val sc = new SparkContext(sparkConf)
    val sqlContext = new SQLContext(sc)
    val ssc = new StreamingContext(sparkConf, Seconds(2))
    //this checkpointdir should be in a conf file, for now it is hardcoded!
    val streamingCheckpointDir = "/home/cloudera/my-recommendation-spark-engine/checkpoint"
    ssc.checkpoint(streamingCheckpointDir)

I think that this must be a misconfiguration in a cloudera configuration file, but which one?

UPDATE2 06/01/2016

Ok, changing the ip (192.168.30.139) instead of the fully qualified name (quickstart.cloudera) now eliminates the previous exception but now this warning arises:

**16/06/01 17:20:04 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources**

If i run the next commands:

[cloudera@quickstart awesome-recommendation-engine]$ sudo service spark-master status
Spark master is running                                    [  OK  ]
[cloudera@quickstart awesome-recommendation-engine]$ sudo service spark-worker status
Spark worker is running                                    [  OK  ]

I can see that spark-master and spark-worker are running, but when i check in 192.168.30.139:18081, the web page that checks spark-worker status, i see:

URL: spark://192.168.30.139:7077
REST URL: spark://192.168.30.139:6066 (cluster mode)
Alive Workers: 1
Cores in use: 4 Total, 0 Used
Memory in use: 6.7 GB Total, 0.0 B Used
Applications: 0 Running, 4 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE
Workers

Worker Id   Address State   Cores   Memory
worker-20160602181029-192.168.30.139-7078

I don't know what to do, i have increased as much resources as i can to the vmware image and the same error happens...

16/06/02 18:32:23 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Thank you very much for reading until here.

aironman
  • 837
  • 5
  • 26
  • 55
  • 1) how this question is related to Apache Spark? 2) you can't read file from HDFS using Scala Source object. You have to use [Hadoop's API[(https://wiki.apache.org/hadoop/HadoopDfsReadWriteExample) for this. Also, you can read it into Spark RDD like this `sparkContext.textFile("hdfs:////quickstart.cloudera:8020/ratings.csv")` – Vitalii Kotliarenko May 30 '16 at 12:04
  • @Hi Vitaliy, thanks for the answer, you have to forgive me, this question is related with spark because i want to do exactly that task, load a file from a hdfs living in a pseudo distributed cloudera vmware image, but still receiving this exception: Exception in thread "main" java.io.IOException: Incomplete HDFS URI, no host: hdfs:/quickstart.cloudera:8020/ratings.csv but it looks like that the spark streaming process located in the host machine is not able to talk with the vmware image. – aironman May 30 '16 at 13:49
  • 1
    it looks like some mess with slashes in URL to me (not the connectivity issue). I just tried it like this: `sc.textFile("hdfs://host/tmp/test.txt")`. Try to left only 2 after `hdfs:` – Vitalii Kotliarenko May 30 '16 at 15:44
  • thank you @VitaliyKotlyarenko, it looks like i am loading the file, no exception but i cannot do any map operations on the file. Please, look the updated thread. – aironman May 30 '16 at 17:24
  • how do you create the spark context programmatically? – Raphael Roth May 31 '16 at 06:56
  • Hi @RaphaelRoth, i have updated the thread with the requested info. Thank you for the answer. – aironman May 31 '16 at 09:43
  • Hi @RaphaelRoth, i have added the build.sbt that i am actually using. – aironman Jun 01 '16 at 09:43
  • Hi @Vitaliy Kotlyarenko, i have added the build.sbt that i am actually using. – aironman Jun 01 '16 at 09:43

0 Answers0