3

I am trying to continuously consume events from kafka. The same application also uses this consumed data, to perform some analysis and update a database in n-second intervals (assume n = 60 seconds).

In the same application, if process1 = Kafka Consumer , process2= Data Analysis and database update logic.

process1 is to be run continuously
process2 is to be executed once every n=60 seconds 

process2 is concerned with computation and database update and hence will take 5-10 seconds to execute. I do not want process1 to stall during the time process2 is executing. Hence, I am using the multiprocessing module (process1,process2 would be thread1,thread2 if I was using the Threading module in python but due to what I have read about GIL and the Threading module not being able to leverage multi-core architecture, I decided to go with the multiprocessing module.) to achieve concurrency in this case. (If my understanding of GIL or Threading module limitations mentioned above is incorrect, my apologies and please feel free to correct me).

The application that I have has a fairly simple interaction between the two processes wherein process1 just fills the queue with all messages it receives in 60 seconds and at the end of 60 seconds , just transfers all messages to process2.

I am having trouble with this transfer logic. How do I transfer contents of the Queue from process1 to process2 (I guess that would be the main process or another process? That is another question I have, should I instantiate 2 processes in addition to the main process?) at the end of 60 seconds and subsequently clear the queue contents so it starts again on another iteration.

So far I have the following:

import sys
from kafka.client import KafkaClient
from kafka import SimpleConsumer
import time
from multiprocessing import Process,Queue

def kafka_init():
    client=KafkaClient('kafka1.wpit.nile.works')
    consumer=SimpleConsumer(client, "druidkafkaconsumer", "personalization.targeting.clickstream.prod")
    return consumer

def consumeMessages(q):
    print "thread started"
    while not q.empty():
        try:
            print q.get(True,1)
        Queue.Empty:
            break
    print "thread ended"
if __name__=="__main__":
    starttime=time.time()
    timeout=starttime+ 10 #timeout of read in seconds
    consumer=kafka_init()
    q=Queue()
    p=Process(target=consumeMessages,args=q)
    while(True):
        q.put(consumer.get_message())
        if time.time()>timeout:
            #transfer logic from process1 to main process here.
            print "Start time",starttime
            print "End time",time.time()
            p.start()
            p.join()
            break

Any help would be much appreciated.

anonuser0428
  • 11,789
  • 22
  • 63
  • 86

1 Answers1

3

The problem you are dealing with is not kafka-specific, so I'm going to use generic "messages" which are simply ints.

The main problem, it seems to me, is that on the one hand you want to process messages as soon as they are produced, and on the other hand only want to update the database every 60 seconds.

If you use q.get(), by default this method call will block until there is a message available in the queue. That could take longer than 60 seconds, which would delay the database update too long. So we can't use a blocking q.get. We need to use q.get with a timeout so that the call is non-blocking:

import time
import multiprocessing as mp
import random
import Queue

def process_messages(q):
    messages = []
    start = time.time()
    while True:
        try:
            message = q.get(timeout=1)
        except Queue.Empty:
            pass
        else:
            messages.append(message)
            print('Doing data analysis on {}'.format(message))
        end = time.time()
        if end-start > 60:
            print('Updating database: {}'.format(messages))
            start = end
            messages = []

def get_messages(q):
    while True:
        time.sleep(random.uniform(0,5))
        message = random.randrange(100)
        q.put(message)

if __name__ == "__main__":
    q = mp.Queue()

    proc1 = mp.Process(target=get_messages, args=[q])
    proc1.start()

    proc2 = mp.Process(target=process_messages, args=[q])
    proc2.start()

    proc1.join()
    proc2.join()

produces output such as:

Doing data analysis on 38
Doing data analysis on 8
Doing data analysis on 8
Doing data analysis on 66
Doing data analysis on 37
Updating database: [38, 8, 8, 66, 37]
Doing data analysis on 27
Doing data analysis on 47
Doing data analysis on 57
Updating database: [27, 47, 57]
Doing data analysis on 85
Doing data analysis on 90
Doing data analysis on 86
Doing data analysis on 22
Updating database: [85, 90, 86, 22]
Doing data analysis on 8
Doing data analysis on 92
Doing data analysis on 59
Doing data analysis on 40
Updating database: [8, 92, 59, 40]
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677