1

I use Spark 1.6 and Kafka 0.8.2.1.

I am trying to fetch some data from Kafka using Spark Streaming and do some operations on that data.

For that I should know the schema of the fetched data, is there some way for this or can we get values from stream by using field names?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
JSR29
  • 354
  • 1
  • 5
  • 17

1 Answers1

0

TL;DR It's not possible directly (esp. with the old Spark 1.6), but not impossible either.

Kafka sees bytes and that's what Spark Streaming expects. You'd have to somehow pass some extra information on fixed fields to get the schema (possibly as a JSON-encoded string) and decode the other field. It is not available out of the box, but is certainly doable.


As a suggestion, I'd send a message where value field would always be two-field data structure with the schema (of a value field) and the value itself (in JSON format).

You could then use one of from_json functions:

from_json(e: Column, schema: StructType): Column Parses a column containing a JSON string into a StructType with the specified schema.

Given from_json was added in Spark 2.1.0, you'd have to register your own custom user-defined function (UDF) that'd deserialize the string value into a corresponding structure (just see how from_json does it and copy it).

Note that DataType object comes with fromJson method that can "map" a JSON-encoded string into a DataType that would describe your schema.

fromJson(json: String): DataType

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420