0

I'm looking for how to read avro messages which has complex structure from Kafka using Spark structure streaming

I then want to parse these message and compare with hbase reference values, and then save outcome into hdfs or another hbase table.

I started with below sample code : https://github.com/Neuw84/spark-continuous-streaming/blob/master/src/main/java/es/aconde/structured/StructuredDemo.java

Avro message schema:

struct[mTimeSeries:
  struct[cName:string,
         eIpAddr:string,
         pIpAddr:string,
         pTime:string,
         mtrcs:array[struct[mName:string,
                            xValues:array[bigint],
                            yValues:array[string],
                            rName:string]]]]

I am struggling to create a row using RowFactory.create for this schema. So do i need to iterate through array fields? I understand that we can use explode functions on dataset to denormalize or access inner fields of struct array once we create dataset with this structure as I do it in Hive. So I would like to create a row as is i.e.exactly how a avro message looks like and then use sql functions to further transform.

    sparkSession.udf().register("deserialize", (byte[] data) -> {
        GenericRecord record = recordInjection.invert(data).get();
        return ***RowFactory.create(record.get("machine").toString(), record.get("sensor").toString(), record.get("data"), record.get("eventTime"));***
    }, DataTypes.createStructType(type.fields())
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Kim
  • 11
  • 4
  • Asking for off-site resources is off-topic for Stackoverflow. See [help]. – OneCricketeer Jan 25 '19 at 22:26
  • Start here https://github.com/AbsaOSS/ABRiS/ – OneCricketeer Jan 25 '19 at 22:26
  • We have spark 2.1.1 version and from_avro is available in latest spark. So not sure if I can use ABRiS. – Kim Jan 28 '19 at 14:46
  • `spark-avro` or the `from_avro` function is not made to work with the Schema Registry. – OneCricketeer Jan 28 '19 at 17:37
  • Also see previous questions https://stackoverflow.com/questions/tagged/confluent-schema-registry+apache-spark?sort=votes&pageSize=15 – OneCricketeer Jan 28 '19 at 17:38
  • Thank you@cricket_007. I have edited my post to specific question. Please help! – Kim Feb 04 '19 at 14:00
  • Sorry, I haven't used Spark in years. If all you have is Avro data (with Confluent Schema Registry) in Kafka, then using HDFS Kafka Connect (or NiFi) would be easier to use to get data into Hadoop – OneCricketeer Feb 04 '19 at 17:11
  • hmm..we are using Nifi to push from external kafka to our own kafka and from there streaming processing should pick up and do some transformation and other steps......so we trying to build streaming application using spark (still undecided to use spark structured streaming vs spark streaming)....expectation is that spark streaming application should deal with avro format as Nifi is not expected to change any format of data as it is used a just ingestion tool – Kim Feb 04 '19 at 17:51
  • NiFI can use the Confluent (or Hortonworks) Avro Schema Registries just fine, at least in recent versions. And I've converted JSON to Avro once, I think – OneCricketeer Feb 04 '19 at 17:57
  • Yes, we use Confluent schema registry to read from external kafka...but while publishing to our own kafka, we do not use schema registry instead we use schema text option.....as per the architecture setup only Nifi should talk to confluent but not our streaming app – Kim Feb 04 '19 at 18:18
  • So you're putting the Avro schema into each and every message? Why would you do that? And you can use [ConvertAvroToJSON](https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-avro-nar/1.5.0/org.apache.nifi.processors.avro.ConvertAvroToJSON/index.html) if you need to... My point here is that Spark seems really unnecessary if you have NiFi – OneCricketeer Feb 04 '19 at 19:18
  • Unfortunately standard here is not to use NiFi for any processing and it must be used as ingestion only tool. Also NiFi is not supposed to convert to Json. It must pass message as is from source it's getting data from. – Kim Feb 04 '19 at 20:04

0 Answers0