3

I want to show the data from HDInsight SPARK using tableau. I was following this video where they have described how to connect the two systems and expose the data.

currently my script itself is very simple as shown below:

 /* csvFile is an RDD of lists, each list representing a line in the CSV file */
val csvLines = sc.textFile("wasb://mycontainer@mysparkstorage.blob.core.windows.net/*/*/*/mydata__000000.csv")

// Define a schema
case class MyData(Timestamp: String, TimezoneOffset: String, SystemGuid: String, TagName: String, NumericValue: Double, StringValue: String)

// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
    s => MyData(s(0),
            s(1),
            s(2),
            s(3),
            s(4).toDouble,
            s(5)         
    )
).toDF()
// Register as a temporary table called "processdata"
myData.registerTempTable("test_table")
myData.saveAsTable("test_table") 

unfortunately I run in to the following error

warning: there were 1 deprecation warning(s); re-run with -deprecation for details
org.apache.spark.sql.AnalysisException: Table `test_table` already exists.;
    at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:209)
    at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:198)

i have also tried to use the following code to overwrite the table if it exists

   import org.apache.spark.sql.SaveMode
    myData.saveAsTable("test_table", SaveMode.Overwrite) 

but still it gives me same error.

warning: there were 1 deprecation warning(s); re-run with -deprecation for details
java.lang.RuntimeException: Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.
    at scala.sys.package$.error(package.scala:27)
    at org.apache.spark.sql.execution.SparkStrategies$DDLStrategy$.apply(SparkStrategies.scala:416)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
    at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
    at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

Can someone please help me fix this issue?

Kiran
  • 2,997
  • 6
  • 31
  • 62

1 Answers1

5

I know it was my mistake, but i'll leave it as an answer as it was not readily available in any of the blogs or forum answers. hopefully it will help someone like me starting with Spark

I figured out that .toDF() actually creates the sqlContext and not the hiveContext based DataFrame. so I have now updated my code as below

// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
    s => MyData(s(0),
            s(1),
            s(2),
            s(3),
            s(4).toDouble,
            s(5)         
    )
)
// Register as a temporary table called "myData"
val myDataFrame = hiveContext.createDataFrame(myData)
myDataFrame.registerTempTable("mydata_stored")
myDataFrame.write.mode(SaveMode.Overwrite).saveAsTable("mydata_stored")

also make sure that the s(4) has proper double value else add try/catch to handle it. i did something like this:

def parseDouble(s: String): Double = try { s.toDouble } catch { case _ => 0.00 }
parseDouble(s(4))

Regards Kiran

Kiran
  • 2,997
  • 6
  • 31
  • 62