14

I have a custom reader for Spark Streaming that reads data from WebSocket. I'm going to try Spark Structured Streaming.

How to create a streaming data source in Spark Structured Streaming?

zero323
  • 322,348
  • 103
  • 959
  • 935
szu
  • 932
  • 1
  • 9
  • 22

4 Answers4

14

As Spark is moving to the V2 API, you now have to implement DataSourceV2, MicroBatchReadSupport, and DataSourceRegister.

This will involve creating your own implementation of Offset, MicroBatchReader, DataReader<Row>, and DataReaderFactory<Row>.

There are some examples of custom structured streaming examples online (in Scala) which were helpful to me in writing mine.

Once you've implemented your custom source, you can follow Jacek Laskowski's answer in registering the source.

Also, depending on the encoding of messages you'll receive from the socket, you may be able to just use the default socket source and use a custom map function to parse the information into whatever Beans you'll be using. Although do note that Spark says that the default socket streaming source shouldn't be used in production!

Hope this helps!

alz2
  • 186
  • 1
  • 7
11

A streaming data source implements org.apache.spark.sql.execution.streaming.Source.

The scaladoc of org.apache.spark.sql.execution.streaming.Source should give you enough information to get started (just follow the types to develop a compilable Scala type).

Once you have the Source you have to register it so you can use it in format of a DataStreamReader. The trick to make the streaming source available so you can use it for format is to register it by creating the DataSourceRegister for the streaming source. You can find examples in META-INF/services/org.apache.spark.sql.sources.DataSourceRegister:

org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
org.apache.spark.sql.execution.datasources.json.JsonFileFormat
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
org.apache.spark.sql.execution.datasources.text.TextFileFormat
org.apache.spark.sql.execution.streaming.ConsoleSinkProvider
org.apache.spark.sql.execution.streaming.TextSocketSourceProvider
org.apache.spark.sql.execution.streaming.RateSourceProvider

That's the file that links the short name in format to the implementation.

What I usually recommend people doing during my Spark workshops is to start development from both sides:

  1. Write the streaming query (with format), e.g.

    val input = spark
      .readStream
      .format("yourCustomSource") // <-- your custom source here
      .load
    
  2. Implement the streaming Source and a corresponding DataSourceRegister (it could be the same class)

  3. (optional) Register the DataSourceRegister by writing the fully-qualified class name, say com.mycompany.spark.MyDataSourceRegister, to META-INF/services/org.apache.spark.sql.sources.DataSourceRegister:

    $ cat META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
    com.mycompany.spark.MyDataSourceRegister
    

The last step where you register the DataSourceRegister implementation for your custom Source is optional and is only to register the data source alias that your end users use in DataFrameReader.format method.

format(source: String): DataFrameReader Specifies the input data source format.

Review the code of org.apache.spark.sql.execution.streaming.RateSourceProvider for a good head start.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • Where is `org.apache.spark.sql.execution.streaming.FileStreamSource` registered? – lfk Jul 12 '18 at 07:24
  • @lfk Added more explanation on the registration step. Let me know if it's still unclear. Thanks. – Jacek Laskowski Jul 12 '18 at 21:12
  • `TextFileFormat` for instance implements `DataSourceRegister`, but not `Source`. The actual `Source` is created by `DataSource.createSource` and it's `FileStreamSource` for all instances of `FileFormat` (DataSource.scala line 265, spark-sql_2.11). I can implement `DataSourceRegister`, but I still don't understand how I can use a custom `Source` instance. – lfk Jul 13 '18 at 06:12
  • Why do you think that `TextFileFormat` is responsible for streaming Datasets for text files? It is only used for batch/non-streaming Datasets. Look at `FileStreamSource` (and `DataSource.createSource` that registers it). Why don't you implement `Source` with `DataSourceRegister`? – Jacek Laskowski Jul 14 '18 at 21:04
  • I tried implementing `Source` and `DataSourceRegister`. The code fails with `java.lang.UnsupportedOperationException: Data source test does not support streamed reading`. Check `DataSource.scala:275`. It needs to implement `StreamSourceProvider` not `Source`. That is of course for V1. V2 data sources follow a different code path. – lfk Jul 16 '18 at 07:57
4

As Spark 3.0 introduced some major changes to the data source API, here is an updated version:

A class named DefaultSource extending TableProvider is the entry-point for the API. The getTable method returns a table class extending SupportsRead. This class has to provide a ScanBuilder as well as define the sources capabilities, in this case TableCapability.MICRO_BATCH_READ.

The ScanBuilder creates a class extending Scan that has to implement the toMicroBatchStream method (for a non-streaming use case we would implement the toBatch method instead). toMicroBatchStream now returns as class extending MicroBatchStream which implements the logic of what data is available and how to partition it (docs).

Now the only thing left is a PartitionReaderFactory that creates a PartitionReader responsible for actually reading a partition of the data with get returning the rows one by one. You can use InternalRow.fromSeq(List(1,2,3)) to convert the data to an InternalRow.

I created a minimal example project: here

Derda
  • 63
  • 7
  • do you have some comprehensive code example to share? snippets are enough, I do not have a clue how to actually put this together. – stewenson May 27 '22 at 21:42
  • 2
    I added an example project to the answer – Derda May 28 '22 at 14:30
  • great stuff! thanks a lot. How do you actually register it? I see that you did ".format(Some.class.getPackageName()) but that does not make a lot of sense to me as there is no reference to it in the implementation. Is not there any resorce file to put that implementation in? How does this work? Differently than in Spark 2.x? – stewenson May 28 '22 at 18:14
  • 1
    Ok I checked it in Spark source code and it loads DefaultSource if load method is fed with package name - it will load packageName + "DefaultSource". – stewenson May 28 '22 at 20:35
  • @Derda: would it makes sense to implement such a thing around `alpakka-kafka` sources? For example `committablePartitionedSource` wraps kafka anyway. See more https://doc.akka.io/docs/alpakka-kafka/current/consumer.html. The rationale being to dcouple spark-streaming from kafka for more flexibility in design of data flow orchestration. – SemanticBeeng Oct 25 '22 at 16:21
2

Also Here is a sample implementation for a custom WebSocket Stream Reader/Writer which implements Offset, MicroBatchReader, DataReader<Row>, and DataReaderFactory<Row>

dumitru
  • 2,068
  • 14
  • 23