1

I just set up Datatorrent RTS (Apache Apex) platform and run the pi demo. I want to consume "avro" messages from kafka and then aggregate and store the data into hdfs. Can I get an example code for this or kafka?

Vic
  • 95
  • 1
  • 7

2 Answers2

2

Here is code for a complete working application uses the new Kafka input operator and the file output operator from Apex Malhar. It converts the byte arrays to Strings and writes them out to HDFS using rolling files with a bounded size (1K in this example); until the file size reaches the bound, it will have a temporary name with the .tmp extension. You can interpose additional operators between these two as suggested by DevT in https://stackoverflow.com/a/36666388):

package com.example.myapexapp;

import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

import org.apache.apex.malhar.kafka.AbstractKafkaInputOperator;
import org.apache.apex.malhar.kafka.KafkaSinglePortInputOperator;
import org.apache.hadoop.conf.Configuration;

import com.datatorrent.api.annotation.ApplicationAnnotation;
import com.datatorrent.api.StreamingApplication;
import com.datatorrent.api.DAG;

import com.datatorrent.lib.io.ConsoleOutputOperator;
import com.datatorrent.lib.io.fs.AbstractFileInputOperator.FileLineInputOperator;
import com.datatorrent.lib.io.fs.AbstractFileOutputOperator;

@ApplicationAnnotation(name="MyFirstApplication")
public class KafkaApp implements StreamingApplication
{

  @Override
  public void populateDAG(DAG dag, Configuration conf)
  {
    KafkaSinglePortInputOperator in = dag.addOperator("in", new KafkaSinglePortInputOperator());
    in.setInitialPartitionCount(1);
    in.setTopics("test");
    in.setInitialOffset(AbstractKafkaInputOperator.InitialOffset.EARLIEST.name());
    //in.setClusters("localhost:2181");
    in.setClusters("localhost:9092");   // NOTE: need broker address, not zookeeper

    LineOutputOperator out = dag.addOperator("out", new LineOutputOperator());
    out.setFilePath("/tmp/FromKafka");
    out.setFileName("test");
    out.setMaxLength(1024);        // max size of rolling output file

    // create stream connecting input adapter to output adapter
    dag.addStream("data", in.outputPort, out.input);
  }
}

/**
 * Converts each tuple to a string and writes it as a new line to the output file
 */
class LineOutputOperator extends AbstractFileOutputOperator<byte[]>
{
  private static final String NL = System.lineSeparator();
  private static final Charset CS = StandardCharsets.UTF_8;
  private String fileName;

  @Override
  public byte[] getBytesForTuple(byte[] t) { return (new String(t, CS) + NL).getBytes(CS); }

  @Override
  protected String getFileName(byte[] tuple) { return fileName; }

  public String getFileName() { return fileName; }
  public void setFileName(final String v) { fileName = v; }
}
Community
  • 1
  • 1
Permutation
  • 171
  • 1
  • 4
1

At a high level your application code would be similar to,

KafkaSinglePortStringInputOperator -> AvroToPojo -> Dimensions Aggregator -> Implementation of AbstractFileOutputOperator

KafkaSinglePortStringInputOperator - If you are working with another datatype you can use KafkaSinglePortByteArrayInputOperator or write a custom implementation.

AvroToPojo - https://github.com/apache/incubator-apex-malhar/blob/5075ce0ef75afccdff2edf4c044465340176a148/contrib/src/main/java/com/datatorrent/contrib/avro/AvroToPojo.java.

This operator converts a GenericRecord to a user given POJO.The user needs to give a POJO class that should be emitted else reflection is used.Currently this is used to read GenericRecords from container files and only primitive types are supported.For reading from Kafka you can model your operator along similar lines and add a Schema object to parse the incoming records.Something like below in the processTuple method should work, Schema schema = new Schema.Parser().parse()); GenericDatumReader reader = new GenericDatumReader(schema);

Dimensions Aggregator - you can pick one of the aggregators given here - https://github.com/apache/incubator-apex-malhar/tree/5075ce0ef75afccdff2edf4c044465340176a148/library/src/main/java/org/apache/apex/malhar/lib/dimensions or write a custom aggregator along the same lines.

FileWriter - from the example in above post.

DevT
  • 11
  • 2