0

I'm trying to continuously send data (sniffed packets with tshark) to the kafka broker/consumer.

Here are the steps I followed:

1. Started zookeeper:

kafka/bin/zookeeper-server-start.sh ../kafka//config/zookeeper.properties

2. Started kafka server:

kafka/bin/kafka-server-start.sh ../kafka/config/server.properties

3. Started kafka consumer:

kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic \
                                           'my-topic' --from-beginning

4. Wrote the following python script to send sniffed data to consumer:

from kafka import KafkaProducer
import subprocess
producer = KafkaProducer(bootstrap_servers='localhost:9092')
producer.send('my-topic', subprocess.check_output(['tshark','-i','wlan0']))

but this is stays on the procuder terminal and outputs:

Capturing on 'wlan0'
605
^C

nothing gets transferred to the consumer.

I know I can use pyshark to implement tshark on python:

import pyshark
capture = pyshark.LiveCapture(interface='eth0')
capture.sniff(timeout=5)
capture1=capture[0]
print capture1

But I don't know how to continuously send the captured packets from the producer to the consumer. Any advice?

Thank you!

HackCode
  • 1,837
  • 6
  • 35
  • 66
  • That's also my question, what dissatisfied you about your previous answer/question that you needed to ask a totally different one? Or why is this question different enough? – Morgan Kenyon Mar 09 '16 at 18:25
  • The previous question was a more generic question, involving the already available scripts for producer but here I'm trying to implement in python. Also, in this question I'm being MORE SPECIFIC regarding what tools and technologies I've tried with. – HackCode Mar 10 '16 at 08:18

1 Answers1

0

Check the following link.

http://zdatainc.com/2014/07/real-time-streaming-apache-storm-apache-kafka/

Implementing the Kafka Producer Here, the main portions of the code for the Kafka producer that was used to test our cluster are defined. In the main class, we setup the data pipes and threads:

LOGGER.debug("Setting up streams");
PipedInputStream send = new PipedInputStream(BUFFER_LEN);
PipedOutputStream input = new PipedOutputStream(send);

LOGGER.debug("Setting up connections");
LOGGER.debug("Setting up file reader");
BufferedFileReader reader = new BufferedFileReader(filename, input);
LOGGER.debug("Setting up kafka producer");
KafkaProducer kafkaProducer = new KafkaProducer(topic, send);

LOGGER.debug("Spinning up threads");
Thread source = new Thread(reader);
Thread kafka = new Thread(kafkaProducer);

source.start();
kafka.start();

LOGGER.debug("Joining");
kafka.join();
The BufferedFileReader in its own thread reads off the data from disk:
rd = new BufferedReader(new FileReader(this.fileToRead));
wd = new BufferedWriter(new OutputStreamWriter(this.outputStream, ENC));
int b = -1;
while ((b = rd.read()) != -1)
{
    wd.write(b);
}
Finally, the KafkaProducer sends asynchronous messages to the Kafka Cluster:
rd = new BufferedReader(new InputStreamReader(this.inputStream, ENC));
String line = null;
producer = new Producer<Integer, String>(conf);
while ((line = rd.readLine()) != null)
{
    producer.send(new KeyedMessage<Integer, String>(this.topic, line));
}
Doing these operations on separate threads gives us the benefit of having disk reads not block the Kafka producer or vice-versa, enabling maximum performance tunable by the size of the buffer.
Implementing the Storm Topology
Topology Definition
Moving onward to Storm, here we define the topology and how each bolt will be talking to each other:
TopologyBuilder topology = new TopologyBuilder();

topology.setSpout("kafka_spout", new KafkaSpout(kafkaConf), 4);

topology.setBolt("twitter_filter", new TwitterFilterBolt(), 4)
        .shuffleGrouping("kafka_spout");

topology.setBolt("text_filter", new TextFilterBolt(), 4)
        .shuffleGrouping("twitter_filter");

topology.setBolt("stemming", new StemmingBolt(), 4)
        .shuffleGrouping("text_filter");

topology.setBolt("positive", new PositiveSentimentBolt(), 4)
        .shuffleGrouping("stemming");
topology.setBolt("negative", new NegativeSentimentBolt(), 4)
        .shuffleGrouping("stemming");

topology.setBolt("join", new JoinSentimentsBolt(), 4)
        .fieldsGrouping("positive", new Fields("tweet_id"))
        .fieldsGrouping("negative", new Fields("tweet_id"));

topology.setBolt("score", new SentimentScoringBolt(), 4)
        .shuffleGrouping("join");

topology.setBolt("hdfs", new HDFSBolt(), 4)
        .shuffleGrouping("score");
topology.setBolt("nodejs", new NodeNotifierBolt(), 4)
        .shuffleGrouping("score");

Notably, the data is shuffled to each bolt until except when joining, as it’s very important that the same tweets are given to the same instance of the joining bolt.

Biswajit Karmakar
  • 9,799
  • 4
  • 39
  • 41