I have a custom reader for Spark Streaming that reads data from WebSocket. I'm going to try Spark Structured Streaming.
How to create a streaming data source in Spark Structured Streaming?
I have a custom reader for Spark Streaming that reads data from WebSocket. I'm going to try Spark Structured Streaming.
How to create a streaming data source in Spark Structured Streaming?
As Spark is moving to the V2 API, you now have to implement DataSourceV2, MicroBatchReadSupport, and DataSourceRegister.
This will involve creating your own implementation of Offset
, MicroBatchReader
, DataReader<Row>
, and DataReaderFactory<Row>
.
There are some examples of custom structured streaming examples online (in Scala) which were helpful to me in writing mine.
Once you've implemented your custom source, you can follow Jacek Laskowski's answer in registering the source.
Also, depending on the encoding of messages you'll receive from the socket, you may be able to just use the default socket source and use a custom map function to parse the information into whatever Beans you'll be using. Although do note that Spark says that the default socket streaming source shouldn't be used in production!
Hope this helps!
A streaming data source implements org.apache.spark.sql.execution.streaming.Source.
The scaladoc of org.apache.spark.sql.execution.streaming.Source
should give you enough information to get started (just follow the types to develop a compilable Scala type).
Once you have the Source
you have to register it so you can use it in format
of a DataStreamReader
. The trick to make the streaming source available so you can use it for format
is to register it by creating the DataSourceRegister
for the streaming source. You can find examples in META-INF/services/org.apache.spark.sql.sources.DataSourceRegister:
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider
org.apache.spark.sql.execution.datasources.json.JsonFileFormat
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
org.apache.spark.sql.execution.datasources.text.TextFileFormat
org.apache.spark.sql.execution.streaming.ConsoleSinkProvider
org.apache.spark.sql.execution.streaming.TextSocketSourceProvider
org.apache.spark.sql.execution.streaming.RateSourceProvider
That's the file that links the short name in format
to the implementation.
What I usually recommend people doing during my Spark workshops is to start development from both sides:
Write the streaming query (with format
), e.g.
val input = spark
.readStream
.format("yourCustomSource") // <-- your custom source here
.load
Implement the streaming Source
and a corresponding DataSourceRegister
(it could be the same class)
(optional) Register the DataSourceRegister
by writing the fully-qualified class name, say com.mycompany.spark.MyDataSourceRegister
, to META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
:
$ cat META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
com.mycompany.spark.MyDataSourceRegister
The last step where you register the DataSourceRegister
implementation for your custom Source
is optional and is only to register the data source alias that your end users use in DataFrameReader.format method.
format(source: String): DataFrameReader Specifies the input data source format.
Review the code of org.apache.spark.sql.execution.streaming.RateSourceProvider for a good head start.
As Spark 3.0 introduced some major changes to the data source API, here is an updated version:
A class named DefaultSource
extending TableProvider
is the entry-point for the API. The getTable
method returns a table class extending SupportsRead
. This class has to provide a ScanBuilder
as well as define the sources capabilities, in this case TableCapability.MICRO_BATCH_READ
.
The ScanBuilder
creates a class extending Scan
that has to implement the toMicroBatchStream
method (for a non-streaming use case we would implement the toBatch
method instead). toMicroBatchStream
now returns as class extending MicroBatchStream
which implements the logic of what data is available and how to partition it (docs).
Now the only thing left is a PartitionReaderFactory
that creates a PartitionReader
responsible for actually reading a partition of the data with get
returning the rows one by one. You can use InternalRow.fromSeq(List(1,2,3))
to convert the data to an InternalRow
.
I created a minimal example project: here