Pyspark Kafka offset range units

Question

I am using Spark as batch to process logs that come from kafka. In each cycle my code should get whatever reaches the kafka consumer. However, I want to put a restrition on the amount of data to get from kafka for each cycle. Let's say 5 GB or 500000 log lines..

offsetRanges = []
def storeOffsetRanges(rdd):
    global offsetRanges
    offsetRanges = rdd.offsetRanges()
    WRITE OFFSETS TO DISK
    return rdd

while True:
    host = "localhost:9092"
    offset = OffsetRange(topic, 0, fromOffset, untilOffset)
    kafka_content = KafkaUtils.createRDD(sc, {"metadata.broker.list": host}, [offset])
    kafka_content.transform(storeOffsetRanges)
    RDD TRANSFORMATIONS..

I will store the offsets in memory and disk in case of driver failure. But how can I impose these kafka offsets to restrict a maximum of data per cycle? What are the units of kafka offset ranges??

Thanks in advance!

score 0 · Accepted Answer · answered Jan 27 '17 at 18:32

Kafka offset units are messages. At each cycle you will get at most untilOffest - fromOffset messages from Kafka. But the data will be read only from one topic partition, so if your topic has more partitions then application will miss some log lines.

As an alternative you can try spark streaming with kafka direct approach. Using this method you will get rid of while True, you will work with log lines in microbatches based on time (not fixed offsets) with optional backpressure mechanism. Then you can omit saving offsets in memory (streaming will handle it), but saving them to disk is still necessary in case of driver restart (see fromOffsets in KafkaUtils.createDirectStream).

I considered using spark streaming but I think using spark as batch would be better in my case. I need to get metrics hourly and started using streaming with windows of 1h. The problem is: in streaming if the driver dies during data processing, the data consumed from kafka will be lost.. Using batch it will write on the disk until all the processing is done and then I delete. Moreover, I found streaming very difficult to debug.. — João, Jan 27 '17 at 22:03

Pyspark Kafka offset range units

1 Answers1