I am using Spark as batch to process logs that come from kafka. In each cycle my code should get whatever reaches the kafka consumer. However, I want to put a restrition on the amount of data to get from kafka for each cycle. Let's say 5 GB or 500000 log lines..
offsetRanges = []
def storeOffsetRanges(rdd):
global offsetRanges
offsetRanges = rdd.offsetRanges()
WRITE OFFSETS TO DISK
return rdd
while True:
host = "localhost:9092"
offset = OffsetRange(topic, 0, fromOffset, untilOffset)
kafka_content = KafkaUtils.createRDD(sc, {"metadata.broker.list": host}, [offset])
kafka_content.transform(storeOffsetRanges)
RDD TRANSFORMATIONS..
I will store the offsets in memory and disk in case of driver failure. But how can I impose these kafka offsets to restrict a maximum of data per cycle? What are the units of kafka offset ranges??
Thanks in advance!