I cannot access programmatically a file within a CDH 5.7 image running in vmware

Question

I have a vmware cloudera image, cdh-5.7 running with centos6.8, i am using OS X as my development machine, and the cdh image to run the code.

UPDATE

This is the build.sbt that i am currently using, i just have updated spark version from official (1.6.1) to 1.6.0-cdh5.7.0:

[cloudera@quickstart awesome-recommendation-engine]$ cat build.sbt 
name := "my-recommendation-spark-engine"

version := "1.0-SNAPSHOT"

scalaVersion := "2.10.4"

val sparkVersion = "1.6.0-cdh5.7.0"

val akkaVersion = "2.3.11" // override Akka to be this version to match the one in Spark

libraryDependencies ++= Seq(
  "org.apache.kafka" % "kafka_2.10" % "0.8.1"
      exclude("javax.jms", "jms")
      exclude("com.sun.jdmk", "jmxtools")
      exclude("com.sun.jmx", "jmxri"),
   // HTTP client to request data to Amazon
   "net.databinder.dispatch" %% "dispatch-core" % "0.11.1",
   // HTML parser
   "org.jodd" % "jodd-lagarto" % "3.5.2",
   "com.typesafe" % "config" % "1.2.1",
   "com.typesafe.play" % "play-json_2.10" % "2.4.0-M2",
   "org.scalatest" % "scalatest_2.10" % "2.2.1" % "test",
   "org.twitter4j" % "twitter4j-core" % "4.0.2",
   "org.twitter4j" % "twitter4j-stream" % "4.0.2",
   "org.codehaus.jackson" % "jackson-core-asl" % "1.6.1",
   "org.scala-tools.testing" % "specs_2.8.0" % "1.6.5" % "test",
   "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.0-cdh5.7.0",
   "org.apache.spark" % "spark-core_2.10" % "1.6.0-cdh5.7.0",
   "org.apache.spark" % "spark-streaming_2.10" % "1.6.0-cdh5.7.0",
   "org.apache.spark" % "spark-sql_2.10" % "1.6.0-cdh5.7.0",
   "org.apache.spark" % "spark-mllib_2.10" % "1.6.0-cdh5.7.0",
   "com.google.code.gson" % "gson" % "2.6.2",
   "commons-cli" % "commons-cli" % "1.3.1",
   "com.stratio.datasource" % "spark-mongodb_2.10" % "0.11.1",
   // Akka
   "com.typesafe.akka" %% "akka-actor" % akkaVersion,
   "com.typesafe.akka" %% "akka-slf4j" % akkaVersion,
   // MongoDB
   "org.reactivemongo" %% "reactivemongo" % "0.10.0"
)

packAutoSettings

resolvers ++= Seq(
  "JBoss Repository" at "http://repository.jboss.org/nexus/content/repositories/releases/",
  "Spray Repository" at "http://repo.spray.cc/",
  "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/",
  "Akka Repository" at "http://repo.akka.io/releases/",
  "Twitter4J Repository" at "http://twitter4j.org/maven2/",
  "Apache HBase" at "https://repository.apache.org/content/repositories/releases",
  "Twitter Maven Repo" at "http://maven.twttr.com/",
  "scala-tools" at "https://oss.sonatype.org/content/groups/scala-tools",
  "Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/",
  "Second Typesafe repo" at "http://repo.typesafe.com/typesafe/maven-releases/",
  "Mesosphere Public Repository" at "http://downloads.mesosphere.io/maven",
  Resolver.sonatypeRepo("public")
)

This is my /etc/hosts file located in the cdh image with a line like this:

127.0.0.1       quickstart.cloudera     quickstart      localhost       localhost.domain

The cloudera version that i am running is:

[cloudera@quickstart bin]$ cat /usr/lib/hadoop/cloudera/cdh_version.properties

# Autogenerated build properties
version=2.6.0-cdh5.7.0
git.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a
cloudera.hash=c00978c67b0d3fe9f3b896b5030741bd40bf541a
cloudera.cdh.hash=e7465a27c5da4ceee397421b89e924e67bc3cbe1
cloudera.cdh-packaging.hash=8f9a1632ebfb9da946f7d8a3a8cf86efcdccec76
cloudera.base-branch=cdh5-base-2.6.0
cloudera.build-branch=cdh5-2.6.0_5.7.0
cloudera.pkg.version=2.6.0+cdh5.7.0+1280
cloudera.pkg.release=1.cdh5.7.0.p0.92
cloudera.cdh.release=cdh5.7.0
cloudera.build.time=2016.03.23-18:30:29GMT

I can do a ls command in the vmware machine:

[cloudera@quickstart ~]$ hdfs dfs -ls /user/cloudera/ratings.csv
-rw-r--r-- 1 cloudera cloudera 16906296 2016-05-30 11:29 /user/cloudera/ratings.csv

I can read its content:

[cloudera@quickstart ~]$ hdfs dfs -cat /user/cloudera/ratings.csv | wc -l
568454

The code is quite simple, just trying to map its content:

val ratingFile="hdfs://192.168.30.139:8020/user/cloudera/ratings.csv"
//where 192.168.30.139 is the eth0 assigned ip of cloudera image
case class AmazonRating(userId: String, productId: String, rating: Double)

val NumRecommendations = 10
val MinRecommendationsPerUser = 10
val MaxRecommendationsPerUser = 20
val MyUsername = "myself"
val NumPartitions = 20

println("Using this ratingFile: " + ratingFile)
  // first create an RDD out of the rating file
val rawTrainingRatings = sc.textFile(ratingFile).map {
    line =>
      val Array(userId, productId, scoreStr) = line.split(",")
      AmazonRating(userId, productId, scoreStr.toDouble)
}

// only keep users that have rated between MinRecommendationsPerUser and MaxRecommendationsPerUser products
val trainingRatings = rawTrainingRatings.groupBy(_.userId).filter(r => MinRecommendationsPerUser <= r._2.size  && r._2.size < MaxRecommendationsPerUser).flatMap(_._2).repartition(NumPartitions).cache()

println(s"Parsed $ratingFile. Kept ${trainingRatings.count()} ratings out of ${rawTrainingRatings.count()}")

I am getting this message:

**06/01/2016 17:20:04 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources**

UPDATE

06/02/2016: I have increased memory (8GB) and available cores (4) to the vmware image, but the same exception than above happens. The file that i am trying to load from HDFS is only 16MB, it cannot be a matter of available resources!

If i update /etc/hosts file with this line:

192.168.30.139 quickstart.cloudera  quickstart  localhost   localhost.domain

instead of 

[cloudera@quickstart bin]$ cat /etc/hosts
127.0.0.1   quickstart.cloudera quickstart  localhost   localhost.domain 

where 192.168.30.139 is the actual assigned ip, i get this exception:

Caused by: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1
because if i run the exact code within the spark-shell, i got this message:
Parsed hdfs://192.168.30.139:8020/user/cloudera/ratings.csv. Kept 73279 ratings out of 568454

Why is it working fine within the spark-shell but it is not programmatically running in the vmware image?

UPDATE

I am running the code using sbt-pack plugin to generate unix commands and run them within the vmware image which has the spark pseudocluster,

This is the code i use to instantiate the sparkconf:

val sparkConf = new SparkConf().setAppName("AmazonKafkaConnector")
                               .setMaster("spark://192.168.30.139:7077")                                       .set("spark.driver.allowMultipleContexts", "true")
    val sc = new SparkContext(sparkConf)
    val sqlContext = new SQLContext(sc)
    val ssc = new StreamingContext(sparkConf, Seconds(2))
    //this checkpointdir should be in a conf file, for now it is hardcoded!
    val streamingCheckpointDir = "/home/cloudera/my-recommendation-spark-engine/checkpoint"
    ssc.checkpoint(streamingCheckpointDir)

I think that this must be a misconfiguration in a cloudera configuration file, but which one?

UPDATE2 06/01/2016

Ok, changing the ip (192.168.30.139) instead of the fully qualified name (quickstart.cloudera) now eliminates the previous exception but now this warning arises:

**16/06/01 17:20:04 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources**

If i run the next commands:

[cloudera@quickstart awesome-recommendation-engine]$ sudo service spark-master status
Spark master is running                                    [  OK  ]
[cloudera@quickstart awesome-recommendation-engine]$ sudo service spark-worker status
Spark worker is running                                    [  OK  ]

I can see that spark-master and spark-worker are running, but when i check in 192.168.30.139:18081, the web page that checks spark-worker status, i see:

URL: spark://192.168.30.139:7077
REST URL: spark://192.168.30.139:6066 (cluster mode)
Alive Workers: 1
Cores in use: 4 Total, 0 Used
Memory in use: 6.7 GB Total, 0.0 B Used
Applications: 0 Running, 4 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE
Workers

Worker Id   Address State   Cores   Memory
worker-20160602181029-192.168.30.139-7078

I don't know what to do, i have increased as much resources as i can to the vmware image and the same error happens...

16/06/02 18:32:23 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Thank you very much for reading until here.

1) how this question is related to Apache Spark? 2) you can't read file from HDFS using Scala Source object. You have to use [Hadoop's API[(https://wiki.apache.org/hadoop/HadoopDfsReadWriteExample) for this. Also, you can read it into Spark RDD like this `sparkContext.textFile("hdfs:////quickstart.cloudera:8020/ratings.csv")` — Vitalii Kotliarenko, May 30 '16 at 12:04
@Hi Vitaliy, thanks for the answer, you have to forgive me, this question is related with spark because i want to do exactly that task, load a file from a hdfs living in a pseudo distributed cloudera vmware image, but still receiving this exception: Exception in thread "main" java.io.IOException: Incomplete HDFS URI, no host: hdfs:/quickstart.cloudera:8020/ratings.csv but it looks like that the spark streaming process located in the host machine is not able to talk with the vmware image. — aironman, May 30 '16 at 13:49
it looks like some mess with slashes in URL to me (not the connectivity issue). I just tried it like this: `sc.textFile("hdfs://host/tmp/test.txt")`. Try to left only 2 after `hdfs:` — Vitalii Kotliarenko, May 30 '16 at 15:44
thank you @VitaliyKotlyarenko, it looks like i am loading the file, no exception but i cannot do any map operations on the file. Please, look the updated thread. — aironman, May 30 '16 at 17:24
Hi @RaphaelRoth, i have updated the thread with the requested info. Thank you for the answer. — aironman, May 31 '16 at 09:43
Hi @RaphaelRoth, i have added the build.sbt that i am actually using. — aironman, Jun 01 '16 at 09:43
Hi @Vitaliy Kotlyarenko, i have added the build.sbt that i am actually using. — aironman, Jun 01 '16 at 09:43

I cannot access programmatically a file within a CDH 5.7 image running in vmware

0 Answers0