2

We are trying to migrate Kafka KSQL to our system and would like to share some problems that we could not solve during the process. We have 3 Kafka nodes in our cluster, each of the servers has:

8 CORE  
50G+ RAM  
100G ssd  

On each server we have zookeeper to manage the cluster. All the OS limits are increased so the nodes could use more resources than it needs:

Xmx: 10G  
Xms: 10G  
nofiles: 500000

For now, the traffic to the cluster from producer is minor (~ 10 messages per second). Right now we have only one producer and the message format is:

{"user_id": <id|INT>, "action_id": <id|INT>, "amount": <amount|FLOAT>}

Topic in Kafka is divided onto 6 partitions with 1 replication:

Topic:<some_topic>   PartitionCount:6        ReplicationFactor:1     Configs:
        Topic: <some_topic>  Partition: 0    Leader: 0       Replicas: 0     Isr: 0
        Topic: <some_topic>  Partition: 1    Leader: 1       Replicas: 1     Isr: 1
        Topic: <some_topic>  Partition: 2    Leader: 2       Replicas: 2     Isr: 2
        Topic: <some_topic>  Partition: 3    Leader: 0       Replicas: 0     Isr: 0
        Topic: <some_topic>  Partition: 4    Leader: 1       Replicas: 1     Isr: 1
        Topic: <some_topic>  Partition: 5    Leader: 2       Replicas: 2     Isr: 2

Now, of course, the nodes are underutilized and in kafka side everything is more than ok )

We would like to use KSQL on top of Kafka to be able to filter data coming to our system with SQL. Here are KSQL server resources:

32 CORE
100G+ RAM
50G+ ssd

We have only one table:

 Field   | Type                      
-------------------------------------
 ROWTIME   | BIGINT           (system) 
 ROWKEY    | VARCHAR(STRING)  (system) 
 ACTION_ID | INTEGER                   
 USER_ID   | INTEGER                   
 AMOUNT    | DOUBLE         

Here is the command the table was created with:

create table <some_table> (action_id INT, user_Id INT, amount DOUBLE) with (KAFKA_TOPIC='<some_topic>', VALUE_FORMAT='JSON', KEY = 'user_id');

In our application, we need to subscribe to the table by user_id, like this:

SELECT * FROM <some_table> WHERE USER_ID=<some_user_id>;

For the production KSQL server configuration, we use the official recommendation from confluent: https://docs.confluent.io/current/ksql/docs/installation/server-config/config-reference.html#recommended-ksql-production-settings

The OS and software limits are also increased for the KSQL server:

Xmx: 10G  (we have tried till 50G)
Xms: 10G  (we have tried till 50G)
nofiles: 500000

In case we use only one subscription, we don't get any issues (everything is fine in this case).
But we need more than 200000+ subscriptions overall. So when we try to get 100-200 parallel subscriptions, we are getting "read timeouts" in our client. In the server, we do not see any abnormal load that could affect KSQL.
We suppose that the issue is related only with KSQL because when we try to use another KSQL server(in a different machine), at the same time we can see that the second server is working fine and can handle some 1-20 subscriptions.

I could not find any benchmark on internet connected with KSQL server, and in the documentation, as well I could not find any mention of the use cases of the KSQL, maybe it's designed only to serve few connections with huge data, or maybe our system is misconfigured so we should fixed it to use the software for our goals.
Any suggestion would be helpful.
Thanks in advance )

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
Rafik Avtoyan
  • 413
  • 3
  • 12
  • 1
    Can you provide a bit more context on the use case and why you need to have 200K queries. – Hojjat Aug 27 '18 at 18:05
  • I will give a very simple case that we need to implement in our system, we can have more than 200000 active users in our system at peak times. So each user should subscribe e.g. to his balance updates. Also, I would like to note, that we use autoscaler in the cloud, and one instance(server) can serve N number of users. So each instance needs to get only the balance updates for its users, and not for all the users. So for us one user, for now, is one query(subscription). – Rafik Avtoyan Aug 28 '18 at 08:26
  • I would like to add, that there is another service who will fill(produce) to kafka all the balance updates for all the users. So it should be great if I could subscribe from my application by some filter. – Rafik Avtoyan Aug 28 '18 at 08:28
  • I have some simple diagram here, that will help to undersend the problem we have: https://drive.google.com/file/d/1rLp64LjMO8zeIxZ7CrHXV3y3j68T7LXl/view?usp=sharing – Rafik Avtoyan Aug 28 '18 at 08:50
  • I'm also not sure why you'd have 200K unique queries given that you're data is so simple. Perhaps you mean you'll have 200K consumers of this topic? And you're using KSQL for consuming this data rather than the plain Kafka Consumer libraries? – OneCricketeer Nov 03 '18 at 05:19
  • Thanks for your reply, but the goal is that we need some separated queues, so I could have many small queues, by which I could separate the traffic for each user update, so in a specific node, I would get only the traffic that the node needs. But as I understand the Kafka is not designed to have many partitions(or topics), it is designed to handle communications between applications, and dynamically you could not change the partitions count in Kafka, it gives only CLI API to change. – Rafik Avtoyan Nov 03 '18 at 08:21
  • For each topic(partition) Kafka needs to create some file(directory) structure, and we have tried to separate our traffic by partitions, and it takes too long time with Kafka. So we have moved to Redis and the Redis gives really great pub/sub for our case, we do not need a guaranteed traffic, we need to dynamically change channels count and high performance for our messages, so with Redis, we could runtime create a new channel for each active user, and the higher level application could get only the traffic that is directed to the user for which the node is responsible. – Rafik Avtoyan Nov 03 '18 at 08:21

1 Answers1

0

The reason you're running into scalability issues with ksqlDB is that you're using push queries in a way they were not designed to be used.... yet!

The push query:

SELECT * FROM <some_table> WHERE USER_ID=<some_user_id>;

Which you're using to subscribe to updates for a specific user seems a totally sensible thing to do.

However, in the version of ksql you're using such push queries we only intended for use by humans executing commands at the CLI. Each such query will, internally, consume a chunk of server resources and consume ALL rows from the source topic.

Basically, push queries do not scale.

The ksqlDB team are actively working on enhancing ksql to support this exact style of use case as we recognise this is a common thing to want to do. (See https://github.com/confluentinc/ksql/issues/5517).

In the meantime, the way to achieve this would be to consume the data directly from Kafka using your own consumers and do the filtering locally.

Andrew Coates
  • 1,775
  • 1
  • 10
  • 16