13

One query on spark structured streaming integration with HIVE table.

I have tried to do some examples of spark structured streaming.

here is my example

 val spark =SparkSession.builder().appName("StatsAnalyzer")
     .enableHiveSupport()
     .config("hive.exec.dynamic.partition", "true")
     .config("hive.exec.dynamic.partition.mode", "nonstrict")
     .config("spark.sql.streaming.checkpointLocation", "hdfs://pp/apps/hive/warehouse/ab.db")
     .getOrCreate()

 // Register the dataframe as a Hive table

 val userSchema = new StructType().add("name", "string").add("age", "integer")
 val csvDF = spark.readStream.option("sep", ",").schema(userSchema).csv("file:///home/su/testdelta") 
 csvDF.createOrReplaceTempView("updates")
 val query= spark.sql("insert into table_abcd select * from updates")

 query.writeStream.start()

As you can see in the last step while writing data-frame to hdfs location, , the data is not getting inserted into the exciting directory (my existing directory having some old data partitioned by "age").

I am getting

spark.sql.AnalysisException : queries with streaming source must be executed with writeStream start()

Can you help why i am not able to insert data in to existing directory in hdfs location ? or is there any other way that i can do "insert into " operation on hive table ?

Looking for a solution

BigD
  • 850
  • 2
  • 17
  • 40
  • Ok my issue is not readStream...how to insert that data into existing hive table? I need to do insert into operation – BigD Dec 28 '18 at 21:15
  • 1
    Yes am getting spark.sql.AnalysisException : queries with streaming source must be executed with writeStream start() – BigD Dec 28 '18 at 21:24
  • 1
    my question is how to do transformations like JOIN ? – BigD Dec 28 '18 at 21:30
  • 1
    i want to join stream data from kafka or csv and static data from HIVE... hafter writting everything to hive doesn't work as i need to perform all operations in streaming manner.... – BigD Dec 28 '18 at 21:36
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/185895/discussion-between-bigd-and-cricket-007). – BigD Dec 28 '18 at 22:19
  • If you want to "watch" for file appending operations, use something like Fluentbit or Filebeat. Then you can stream CSV data to a Spark socket server, or Kafka. From there, you can write to HDFS/Hive – OneCricketeer Dec 31 '18 at 01:56

3 Answers3

9

Spark Structured Streaming does not support writing the result of a streaming query to a Hive table.

scala> println(spark.version)
2.4.0

val sq = spark.readStream.format("rate").load
scala> :type sq
org.apache.spark.sql.DataFrame

scala> assert(sq.isStreaming)

scala> sq.writeStream.format("hive").start
org.apache.spark.sql.AnalysisException: Hive data source can only be used with tables, you can not write files of Hive data source directly.;
  at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:246)
  ... 49 elided

If a target system (aka sink) is not supported you could use use foreach and foreachBatch operations (highlighting mine):

The foreach and foreachBatch operations allow you to apply arbitrary operations and writing logic on the output of a streaming query. They have slightly different use cases - while foreach allows custom write logic on every row, foreachBatch allows arbitrary operations and custom logic on the output of each micro-batch.

I think foreachBatch is your best bet.

import org.apache.spark.sql.DataFrame
sq.writeStream.foreachBatch { case (ds: DataFrame, batchId: Long) =>
  // do whatever you want with your input DataFrame
  // incl. writing to Hive
  // I simply decided to print out the rows to the console
  ds.show
}.start

There is also Apache Hive Warehouse Connector that I've never worked with but seems like it may be of some help.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • Hello.. One more query... Is it possible to stream a file, lets say csv file to spark streaming? If i append any line to that file then that should captures by spark streaming..is it possible? Adding files to directory and streaming is possible... File streaming is possible in spark? – BigD Dec 30 '18 at 22:08
  • 1
    @BigD No it's not possible to append lines to a CSV file and "catch" the updates. Only new files are going to be processed. See the docs at http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources – Jacek Laskowski Dec 30 '18 at 22:10
  • Hello,I have tried but throwing error.. can you please help ... val csvDF = spark.readStream.option("sep", ",").schema(userSchema).csv("file:///home/sas/testdelta") csvDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) => batchDF.show }.start() :56: error: value foreachBatch is not a member of org.apache.spark.sql.streaming.DataStreamWriter[org.apache.spark.sql.Row] csvDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) => – BigD Jan 02 '19 at 17:18
  • @BigD What Spark version do you use? `foreachBatch` is available as of 2.4.0. Use `foreach` instead. – Jacek Laskowski Jan 02 '19 at 17:38
  • i am using spark 2.3.0.. i tried both.. in my example csvDF is dataframe.. how can i insert that to hive using foreach.. that is my issue – BigD Jan 02 '19 at 18:51
  • @BigD Create a repo at github and share the link with the code you have so far. Ping me offline for code review. – Jacek Laskowski Jan 03 '19 at 12:16
  • https://stackoverflow.com/questions/54021043/how-to-use-foreachrdd-in-spark-streaming-for-csv-files/54021500#54021500 – BigD Jan 03 '19 at 13:49
  • We're talking Spark Structured Streaming here while you're using Spark Streaming in the link above. They're hugely different. – Jacek Laskowski Jan 03 '19 at 15:03
  • i really not aware about how spark structured streaming ingesting data to HIVE. that is the reason i have tried with spark streaming ..there also am failing.. would you be able to give some working examples to ingest data to hive ? – BigD Jan 03 '19 at 15:06
  • am not able to write results to hdfs location... after aggregating data on spark structured streaming ... can you please tell how to use foreach method to achieve my goal – BigD Jan 11 '19 at 17:46
  • Use gist.github.com to post the entire code and show the inputs. – Jacek Laskowski Jan 11 '19 at 20:03
  • @JacekLaskowski your code will never compile in Spark 2.4.0. Did you ever test it? – Viacheslav Rodionov Jan 25 '19 at 13:38
  • Hi, is it possible to use checkpointing with the foreachBatch? Let's say that something fails, can I recover the data? – Ron F Apr 17 '19 at 06:13
  • @RonF Can you ask a separate question so it gets proper attention of the Spark community? Merci. – Jacek Laskowski Apr 17 '19 at 13:56
  • Hi @JacekLaskowski , Could you please give any valid examples for the foreachBatch batch sink. I have a use case to write some raw data to Hive tables in real time(structured streaming - source kafka) – Albin Chandy Oct 26 '21 at 15:20
2

On HDP 3.1 with Spark 2.3.2 and Hive 3.1.0 we have used Hortonwork's spark-llap library to write structured streaming DataFrame from Spark to Hive. On GitHub you will find some documentation on its usage.

The required library hive-warehouse-connector-assembly-1.0.0.3.1.0.0-78.jar is available on Maven and needs to be passed on in the spark-submit command. There are many more recent versions of that library, although I haven't had the chance to test them.

After creating the Hive table manually (e.g. through beeline/Hive shell) you could apply the following code:

import com.hortonworks.hwc.HiveWarehouseSession

val csvDF = spark.readStream.[...].load()

val query = csvDF.writeStream
  .format(HiveWarehouseSession.STREAM_TO_STREAM)
  .option("database", "database_name")
  .option("table", "table_name")
  .option("metastoreUri", spark.conf.get("spark.datasource.hive.warehouse.metastoreUri"))
  .option("checkpointLocation", "/path/to/checkpoint/dir")
  .start()

query.awaitTermination()
Michael Heil
  • 16,250
  • 3
  • 42
  • 77
1

Just in case someone actually tried the code from Jacek Laskowski he knows that it does not really compile in Spark 2.4.0 (check my gist tested on AWS EMR 5.20.0 and vanilla Spark). So I guess that was his idea of how it should work in some future Spark version. The real code is:

scala> import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Dataset

scala> sq.writeStream.foreachBatch((batchDs: Dataset[_], batchId: Long) => batchDs.show).start
res0: org.apache.spark.sql.streaming.StreamingQuery = 
org.apache.spark.sql.execution.streaming.StreamingQueryWrapper@5ebc0bf5
Viacheslav Rodionov
  • 2,335
  • 21
  • 22