3

Some of my Kafka consumers (but not all) show an interesting pattern regarding their lag.

The following image shows two good examples:

enter image description here

enter image description here

dark-blue:

  • about 200 messages per second in topic
  • 32 partitions
  • 1 consumer in group (Python client, running on Kubernetes)

light-blue (same topic as dark-blue):

  • so also about 200 messages per second in topic
  • so also 32 partitions
  • 1 consumer in group (also a Python client, running on Kubernetes)

brown:

  • about 1500 messages per second in topic
  • 40 partitions
  • 2 consumers in group (Java/Spring client, running on Kubernetes)

Both sawtoothy clients can handle much larger throughput than that (tested by pausing, resuming and letting them catch up), so they are not working on their limits.

Rebalancing does happen sometimes (according to the logs) but much less often than the jumps in the diagram, and the few events also don't correlate in time with the jumps.

The messages also do not come in batches. Here is the additional information for one of the affected topics:

enter image description here

enter image description here

enter image description here

Where can this pattern originate from?

Tobias Hermann
  • 9,936
  • 6
  • 61
  • 134
  • Well, the clients need to round robin all partitions, blocking while they consume from ones. If you add more consumers to the group, do you see the same? – OneCricketeer Sep 25 '18 at 12:05
  • @cricket_007 Good idea! But I just looked at the other consumers and found a client that is a counterexample. I added it to the question. – Tobias Hermann Sep 25 '18 at 16:04
  • At first glance looks like the the consumer group is rebalancing (maybe consumer threads are dying&restarting periodically). But otherwise, unless you send in batches of 3k records at once, then that brown line looks strange – OneCricketeer Sep 25 '18 at 17:37
  • @cricket_007 There are rebalancing entries in Sentry, but they are much less frequent than the jumps in the diagram, and even the few ones don't correlate in time with the jumps. Also the messages are not coming in in batches. I've just added the graphs needed so show this to my question. – Tobias Hermann Sep 27 '18 at 11:50
  • @cricket_007 I [found the explanation](https://stackoverflow.com/a/52541503/1866775). Thanks for your help in investigating. :) – Tobias Hermann Sep 27 '18 at 16:35

1 Answers1

3

Just found out that the low-frequency sawtooth pattern is not real. And the explanation is quite interesting. ;)

When I check the consumer lag using the command line (kafka-consumer-groups --bootstrap-server=[...] --group [...] --describe), I see that the total consumer lag (sum of lags per partition) fluctuates very quickly. At one point it's around 6000, 2 seconds later its around 1000, again 2 seconds later it might be 9000.

The graph shown however seems to be based on samples taken with a lower frequency, which violates the Nyquist–Shannon sampling theorem. So the averaging does not work, and we see a Moiré pattern.

Conclusion: The sawtooth pattern is just an illusion.


For completeness, here is a simulation depicting the effect:

#!/usr/bin/env python3
"""Simulate moire effect of Kafka-consumer-lag graph.
"""

import random

import matplotlib.pyplot as plt


def x_noise_sampling() -> int:
    return 31 + random.randint(-6, 6)


def main() -> None:
    max_x = 7000
    sample_rate = 97
    xs = list(range(max_x))
    ys = [x % 100 for x in xs]
    xs2 = [x + x_noise_sampling() for x in range(0, max_x - 100, sample_rate)]
    ys2 = [ys[x2] for x2 in xs2]

    plt.figure(figsize=(16, 9))
    plt.xlabel('Time')
    plt.xticks([])
    plt.yticks([])
    plt.ylabel('Consumer lag')
    signal, = plt.plot(xs, ys, '-')
    samples, = plt.plot(xs2, ys2, 'bo')
    interpolated, = plt.plot(xs2, ys2, '-')
    plt.legend([signal, samples, interpolated], ['Signal', 'Samples', 'Interpolated samples'])
    plt.savefig('sawtooth_moire.png', dpi=100)
    plt.show()


if __name__ == '__main__':
    main()

enter image description here

Tobias Hermann
  • 9,936
  • 6
  • 61
  • 134
  • Curious - what lead you down this path of investigation? And what library were you using to do monitoring anyway? FWIW, I recently stumbled on this https://github.com/zalando-incubator/remora – OneCricketeer Sep 27 '18 at 21:11
  • @cricket_007 It was literally a showerthought, so mainly luck. ;) The tool is Prometheus. – Tobias Hermann Sep 28 '18 at 06:55