0

I am trying to modify the hello-samza tutorial to:

(1) Read from a kafka topic on a remote broker (ie not localhost) (2) Write the message to a file

I modified the WikipediaFeedStreamTask.java to look like the following:

public class WikipediaFeedStreamTask implements StreamTask {
  private static final SystemStream OUTPUT_STREAM = new SystemStream("kafka", "wikipedia-ra\
w");

  @Override
  public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoo\
rdinator coordinator) {
      //System.out.println("Message Received!");
      //System.out.println(envelope.getMessage());
      try{
      PrintWriter writer = new PrintWriter("test.txt", "UTF-8");
      writer.println(envelope.getMessage());
      writer.println("The second line");
      writer.close();}
      catch(IOException e)
          {}
      Map<String, Object> outgoingMap = WikipediaFeedEvent.toMap((WikipediaFeedEvent) envel\
ope.getMessage());
    collector.send(new OutgoingMessageEnvelope(OUTPUT_STREAM, outgoingMap));
  }
}

This is just the standard file, with an addition to write the message to file.

And I modified the properties file to look like:

# Job
job.factory.class=org.apache.samza.job.yarn.YarnJobFactory
job.name=wikipedia-feed

# YARN
yarn.package.path=file://${basedir}/target/${project.artifactId}-${pom.version}-dist.tar.gz

# Task
task.class=samza.examples.wikipedia.task.WikipediaFeedStreamTask
task.inputs=wikipedia.#en.wikipedia,wikipedia.#en.wiktionary,wikipedia.#en.wikinews

# Serializers
serializers.registry.json.class=org.apache.samza.serializers.JsonSerdeFactory

# Wikipedia System
systems.wikipedia.samza.factory=samza.examples.wikipedia.system.WikipediaSystemFactory
systems.wikipedia.host=irc.wikimedia.org
systems.wikipedia.port=6667

# Kafka System
systems.kafka.samza.factory=org.apache.samza.system.kafka.KafkaSystemFactory
systems.kafka.samza.msg.serde=json
systems.kafka.consumer.zookeeper.connect=REMOTE-ZOOKEEPER-IP:2181/
systems.kafka.producer.bootstrap.servers=REMOTE-BROKER-IP:9092

# Job Coordinator
job.coordinator.system=kafka

When I run the job (like so), I see data from the wikipedia stream in test.txt. I am clearly incorrect in my assumption that simply modifying the kafka consumer value in the .properties file will force samza to read from that broker. So what do I need to change?

Where do I specify what topic name samza should be listening for?

Mohammad Ahmad
  • 133
  • 1
  • 1
  • 8

1 Answers1

1

I see that you have modified the connection strings of kafka system. However, your StreamTask's input still refers to the stream in wikipedia: task.inputs=wikipedia.#en.wikipedia,wikipedia.#en.wiktionary,wikipedia.#en.wikinews

You should change the value of task.inputs to read "kafka.$yourInputStreamName". Please give it a try. I think that should fix your issue.

  • Thank you. I changed task.inputs to read 'kafka.' but I didn't see the file 'test.txt' being written as is expected in my process function. I then changed the task.inputs to read 'kafka.test' (since test is the name of the topic I want to consume), however I still didn't see the file being written. I verified the IP:Port addresses are all correct, but it seems Samza isn't reading from the kafka stream. I am able to verify that these messages are being written to the broker. – Mohammad Ahmad Jul 11 '17 at 16:41
  • Oh my bad. I think I have a typo in my comment. Fixed it. It should be `kafka.test` Not sure why it is still not working. Couple of things to verify: 1. Make sure the config file under "deploy/samza" actually contains the updated config. 2. Check some of the SamzaContainerMetrics and/or KafkaSystemConsumerMetrics - to ensure that process-calls or messages read is non-zero. Read more about [metrics|https://samza.apache.org/learn/documentation/0.10/container/metrics.html] here. 3. Anything specific about your environment that can prevent your process from communicating with the broker? – Navina Ramesh Jul 11 '17 at 18:52