Apache-Apex : How to write data from kafka topic to hdfs filesystem?

Question

I am trying to read dat from kafka topic and writing it to HDFS filesystem, I buid my project using the apex malhar from [ https://github.com/apache/apex-malhar/tree/master/examples/kafka]. Unfortunally, after setting up the kafka properties and hadoop config the data are not created in my hdfs 2.6.0 system. PS: the console dosen't show any error and everything seems to work fine

here the code I am using for my app

public class TestConsumer {
    public static void main(String[] args) {
        Consumer consumerThread = new Consumer(KafkaProperties.TOPIC);
        consumerThread.start();
        ApplicationTest a = new ApplicationTest();
         try {
            a.testApplication();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Here the ApplicationTest class example from apex malhar

package org.apache.apex.examples.kafka.kafka2hdfs;

import org.apache.log4j.Logger;
import javax.validation.ConstraintViolationException;

import org.junit.Rule;


import org.apache.apex.malhar.kafka.AbstractKafkaInputOperator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.net.NetUtils;

import com.datatorrent.api.LocalMode;

import info.batey.kafka.unit.KafkaUnitRule;




/**
 * Test the DAG declaration in local mode.
 */

public class ApplicationTest
{
  private static final Logger LOG = Logger.getLogger(ApplicationTest.class);
  private static final String TOPIC = "kafka2hdfs";

  private static final int zkPort = NetUtils.getFreeSocketPort();
  private static final int brokerPort = NetUtils.getFreeSocketPort();
  private static final String BROKER = "localhost:" + brokerPort;
  private static final String FILE_NAME = "test";
  private static final String FILE_DIR = "./target/tmp/FromKafka";



  // broker port must match properties.xml
  @Rule
  private static  KafkaUnitRule kafkaUnitRule = new KafkaUnitRule(zkPort, brokerPort);

  public void testApplication() throws Exception
  {
    try {
      // run app asynchronously; terminate after results are checked
      LocalMode.Controller lc = asyncRun();


      lc.shutdown();
    } catch (ConstraintViolationException e) {
        LOG.error("constraint violations: " + e.getConstraintViolations());

    }
  }
  private Configuration getConfig()
  {
    Configuration conf = new Configuration(false);
    String pre = "dt.operator.kafkaIn.prop.";
    conf.setEnum(pre + "initialOffset", AbstractKafkaInputOperator.InitialOffset.EARLIEST);
    conf.setInt(pre + "initialPartitionCount", 1);
    conf.set(pre + "topics", TOPIC);
    conf.set(pre + "clusters", BROKER);

    pre = "dt.operator.fileOut.prop.";
    conf.set(pre + "filePath", FILE_DIR);
    conf.set(pre + "baseName", FILE_NAME);
    conf.setInt(pre + "maxLength", 40);
    conf.setInt(pre + "rotationWindows", 3);

    return conf;
  }

  private LocalMode.Controller asyncRun() throws Exception
  {
    Configuration conf = getConfig();
    LocalMode lma = LocalMode.newInstance();
    lma.prepareDAG(new KafkaApp(), conf);
    LocalMode.Controller lc = lma.getController();
    lc.runAsync();
    return lc;
  }
}

If I had to guess, you're doing `runAsync`, then immediately calling the shutdown method of the controller — OneCricketeer, Jul 15 '18 at 12:33
In any case, if you have Kafka, then you can use Kafka Connect to write messages to HDFS (it doesn't require Confluent installation) — OneCricketeer, Jul 15 '18 at 12:34
Thaks for the reply, but I think the part where it write data into hdfs is defined in the "lma.prepareDAG(new KafkaApp(), conf);" line inside the asyncRun() method. PS: This is the first time I work with it, can you make it more explicit to me ? — AMLOCO, Jul 16 '18 at 09:04
I do not know Apex. I'm hear for the Kafka tag. See documentation from Confluent https://docs.confluent.io/current/connect/connect-hdfs/docs/index.html and a blog https://engineering.pandora.com/creating-a-data-pipeline-with-the-kafka-connect-api-from-architecture-to-operations-56715080ac55 My point here is that you shouldn't need to write any more code than some config files — OneCricketeer, Jul 16 '18 at 11:59
Regarding what you're trying, though, this example doesn't look like your code... https://github.com/apache/apex-malhar/tree/master/examples/kafka/src/main/java/org/apache/apex/examples/kafka/kafka2hdfs — OneCricketeer, Jul 16 '18 at 12:10
ok I will check the confluent documentation. PS: my code is from https://github.com/apache/apex-malhar/blob/master/examples/kafka/src/test/java/org/apache/apex/examples/kafka/kafka2hdfs/ApplicationTest.java — AMLOCO, Jul 16 '18 at 18:13

score 0 · Answer 1 · answered Jul 19 '18 at 23:53

0

After runAsync and before shutdown, you would need to wait for the expected results (otherwise the DAG will exit immediately). That's actually what happens in the example.

answered Jul 19 '18 at 23:53

Thomas

348
1
11

Unfortunately, this is doesn't solve the problem, I still can't find my data stored in the hdfs filsystem – AMLOCO Jul 20 '18 at 09:01

Apache-Apex : How to write data from kafka topic to hdfs filesystem?

1 Answers1