SparkSQL-Scala with POM

Question

I have some problem with Cloudera VM and Spark. First of all, I'm completely new on Spark, and my boss asked to me to run Spark on Scala in a Virtual Machine for some test.

I have downloaded the Virtual Machine on Virtual Box environment, so I open Eclipse and I had a new Project on Maven. Obliviously, after I run previously the Cloudera environment and start all services, like Spark, Yarn, Hive and so on. All services work fine, and all check, in Cloudera services are green. I had do some test with Impala and that works perfectly.

With Eclipse and Scala-Maven environment, the things became worst: that is my very simple code in Scala:

package org.test.spark

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext

object TestSelectAlgorithm {

  def main(args: Array[String]) = {
    val conf = new SparkConf()
      .setAppName("TestSelectAlgorithm")
      .setMaster("local")
    val sc = new SparkContext(conf)

    val sqlContext = new SQLContext(sc)

    val df = sqlContext.sql("SELECT * FROM products").show()
  }
}

The test is very simple, because the table "products" exist: if I copy-and-paste the same query on Impala, the query works fine!

On the Eclipse environment, otherwise, I have some problem:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/06/30 05:43:17 INFO SparkContext: Running Spark version 1.6.0
16/06/30 05:43:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/06/30 05:43:18 WARN Utils: Your hostname, quickstart.cloudera resolves to a loopback address: 127.0.0.1; using 10.0.2.15 instead (on interface eth0)
16/06/30 05:43:18 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/06/30 05:43:18 INFO SecurityManager: Changing view acls to: cloudera
16/06/30 05:43:18 INFO SecurityManager: Changing modify acls to: cloudera
16/06/30 05:43:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cloudera); users with modify permissions: Set(cloudera)
16/06/30 05:43:19 INFO Utils: Successfully started service 'sparkDriver' on port 53730.
16/06/30 05:43:19 INFO Slf4jLogger: Slf4jLogger started
16/06/30 05:43:19 INFO Remoting: Starting remoting
16/06/30 05:43:19 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@10.0.2.15:39288]
16/06/30 05:43:19 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 39288.
16/06/30 05:43:19 INFO SparkEnv: Registering MapOutputTracker
16/06/30 05:43:19 INFO SparkEnv: Registering BlockManagerMaster
16/06/30 05:43:19 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-7d685fc0-ea88-423a-9335-42ca12db85da
16/06/30 05:43:19 INFO MemoryStore: MemoryStore started with capacity 1619.3 MB
16/06/30 05:43:20 INFO SparkEnv: Registering OutputCommitCoordinator
16/06/30 05:43:20 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/06/30 05:43:20 INFO SparkUI: Started SparkUI at http://10.0.2.15:4040
16/06/30 05:43:20 INFO Executor: Starting executor ID driver on host localhost
16/06/30 05:43:20 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57294.
16/06/30 05:43:20 INFO NettyBlockTransferService: Server created on 57294
16/06/30 05:43:20 INFO BlockManagerMaster: Trying to register BlockManager
16/06/30 05:43:20 INFO BlockManagerMasterEndpoint: Registering block manager localhost:57294 with 1619.3 MB RAM, BlockManagerId(driver, localhost, 57294)
16/06/30 05:43:20 INFO BlockManagerMaster: Registered BlockManager
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table not found: products;
    at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:306)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$9.applyOrElse(Analyzer.scala:315)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$9.applyOrElse(Analyzer.scala:310)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:56)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:54)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply(LogicalPlan.scala:54)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:265)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:305)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:54)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:310)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:300)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
    at scala.collection.immutable.List.foldLeft(List.scala:84)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:36)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:36)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
    at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:133)
    at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
    at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
    at org.test.spark.TestSelectAlgorithm$.main(TestSelectAlgorithm.scala:18)
    at org.test.spark.TestSelectAlgorithm.main(TestSelectAlgorithm.scala)
16/06/30 05:43:22 INFO SparkContext: Invoking stop() from shutdown hook
16/06/30 05:43:22 INFO SparkUI: Stopped Spark web UI at http://10.0.2.15:4040
16/06/30 05:43:22 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/06/30 05:43:22 INFO MemoryStore: MemoryStore cleared
16/06/30 05:43:22 INFO BlockManager: BlockManager stopped
16/06/30 05:43:22 INFO BlockManagerMaster: BlockManagerMaster stopped
16/06/30 05:43:22 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/06/30 05:43:22 INFO SparkContext: Successfully stopped SparkContext
16/06/30 05:43:22 INFO ShutdownHookManager: Shutdown hook called
16/06/30 05:43:22 INFO ShutdownHookManager: Deleting directory /tmp/spark-29d381e9-b5e7-485c-92f2-55dc57ca7d25

The main error is (for me):

Exception in thread "main" org.apache.spark.sql.AnalysisException: Table not found: products;

I searched on other site and documentation, and I founded that the problem is connected with the Hive table... but I don't use the Hive table, I use SparkSql...

Can anyone help me, please? Thank you for any reply.

where does this `products` table exist? in relational db? or are you trying to read file from hdfs? — Ronak Patel, Jun 30 '16 at 13:50
from hdfs: perhaps I execute the same query on http://quickstart.cloudera:8888/impala/execute/query/8#query/results ==> IMPALA, in the Virtual Machine - and this works perfectly. — Alessandro, Jun 30 '16 at 13:59
you will need to use dataframe or `create a schema > register temp table > run query` - this code will give you some hint - for textfile format: https://gist.github.com/InvisibleTech/c71cb88b2390eb2223a8 for jsonfile format: http://www.tutorialspoint.com/spark_sql/spark_sql_dataframes.htm — Ronak Patel, Jun 30 '16 at 14:08
if you show me a sample file format, I can give you right solution — Ronak Patel, Jun 30 '16 at 14:16
I would like to create a simple table on hdfs with the result of my query... so, in this mode, I can create another query and so on... also an RDD with the result is good, I would like to create a nested query... — Alessandro, Jun 30 '16 at 14:20

maogautam · Answer 1 · 2016-06-30T14:59:55.453

In spark, For impala there is no direct support as hive has .So, You have to load file. If it is csv you can use spark-csv,

val df = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load("your .csv file location")

import sqlContext.implicits._
import sqlContext._

df.registerTempTable("products")

sqlContext.sql("select * from products").show()

pom dependency for spark-csv

<!-- https://mvnrepository.com/artifact/com.databricks/spark-csv_2.10 -->
<dependency>
    <groupId>com.databricks</groupId>
    <artifactId>spark-csv_2.10</artifactId>
    <version>1.4.0</version>
</dependency>

for avro there is spark-avro

val sqlContext = new SQLContext(sc)

val df = sqlContext.read.avro("your .avro file location")

import sqlContext.implicits._
import sqlContext._

df.registerTempTable("products")


val result= sqlContext.sql("select * from products")
val result.show()

 result.write
    .format("com.databricks.spark.avro")
    .save("Your ouput location")

pom dependency for avro

 <!-- http://mvnrepository.com/artifact/com.databricks/spark-avro_2.10 -->
            <dependency>
                <groupId>com.databricks</groupId>
                <artifactId>spark-avro_2.10</artifactId>
                <version>2.0.1</version>
            </dependency>

and parquet spark has in-build support

    val sqlContext = new SQLContext(sc)
    val parquetFile = sqlContext.read.parquet("your parquet file location")

    parquetFile.registerTempTable("products")

    sqlContext.sql("select * from products").show()

ok, first of all: thank you very much! Now I have to understand what kind of path I can put into "your file location"! Now the tables "products" is founded, for this the problem has been resolved!!! ;-) But, if you wont to help me another time, can you tell me where do I found the path for the avro-format output? — Alessandro, Jun 30 '16 at 14:48
In this code we did not saved file to any location for that you have to do some extra work. please check updated code. — maogautam, Jun 30 '16 at 14:57

maogautam · Accepted Answer · 2016-06-30T15:17:44.107

1

Can you check /user/cloudera/.sparkStaging/stagingArea location exist or it contains .avro file?? And please change "Your ouput location" by directory location.
Please check avro github page for more detail. https://github.com/databricks/spark-avro

edited Jun 30 '16 at 15:17

answered Jun 30 '16 at 15:11

maogautam

318
1
13

SparkSQL-Scala with POM

2 Answers2