6

Are there any alerting options for scenarios where a Kafka Connect Connector or a Connector task fails or experiences errors?

We have Kafka Connect running, it runs well, but we've had errors that need to be manually traced and discovered. And often, it's been in an error state for a week before a human notices a problem.

clay
  • 18,138
  • 28
  • 107
  • 192

6 Answers6

3

(I still can't comment so to respond to clay's answer...)

NOTE: There is a bug in the JMX metrics for task/connector status (at time of posting: 5/11/2020)

1) When a task fails, it's status metrics dissapear. This is a known issue and there is a fix in progress. A Jira can be found here and PR can be found here.

2) Don't use the Connector metric to monitor the status of the tasks. The Connector can show up as running fine but the tasks can be in a failure state, you need to monitor the tasks directly. This is mentioned in Confluent's Connector monitoring tips where it says:

In most cases, connector and task states will match, though they may be different for short periods of time when changes are occurring or if tasks have failed. For example, when a connector is first started, there may be a noticeable delay before the connector and its tasks have all transitioned to the RUNNING state. States will also diverge when tasks fail since Connect does not automatically restart failed tasks.

bmoe24x
  • 121
  • 1
  • 6
2

Building on what Randall says, this shell script uses the Confluent CLI to show the state of all connectors and tasks. You could use that as the basis of alerting:

Robin@asgard02 ~/c/confluent-3.3.0> ./bin/confluent status connectors| \
                                    jq '.[]'| \
                                    xargs -I{connector} ./bin/confluent status {connector}| \
                                    jq -c -M '[.name,.connector.state,.tasks[].state]|join(":|:")'| \
                                    column -s : -t| \
                                    sed 's/\"//g'| \
                                    sort

file-sink-mysql-foobar       |  RUNNING  |  RUNNING
jdbc_source_mysql_foobar_01  |  RUNNING  |  RUNNING
Robin Moffatt
  • 30,382
  • 3
  • 65
  • 92
1

One option is to use Kafka Connect's REST API to check the health of the worker and the status of the connectors. This approach is simple to automate using simple scripts or many monitoring systems. It works with the standalone worker and distributed workers, though in the latter case you can make requests to any Kafka Connect worker in the cluster.

If you want to check the health of all the connectors, the first step is to get the list of deployed connectors:

GET /connectors

That returns a JSON array of connector names. For each of those, issue a request to check the status of the named connector:

GET /connectors/(string: name)/status

The response will include status information about the connector and its tasks. For example, the following shows a connector that is running two tasks, with one of those tasks still running and the other having failed with an error:

HTTP/1.1 200 OK

{
    "name": "hdfs-sink-connector",
    "connector": {
        "state": "RUNNING",
        "worker_id": "fakehost:8083"
    },
    "tasks":
    [
        {
            "id": 0,
            "state": "RUNNING",
            "worker_id": "fakehost:8083"
        },
        {
            "id": 1,
            "state": "FAILED",
            "worker_id": "fakehost:8083",
            "trace": "org.apache.kafka.common.errors.RecordTooLargeException\n"
        }
    ]
}

These are just a sampling of what the REST API allows you to do.

Randall Hauch
  • 7,069
  • 31
  • 28
  • This was the best answer when it was written, but newer versions of Confluent offer native metrics which are better for automated monitoring + alerting. – clay May 05 '20 at 14:51
1

I know that this is a really old question, so when we ran into a similar issue as we use Kafka Connect really heavily, and as its very difficult to individually monitor each connectors especially when you are looking at managing more than 150+ connectors.

Hence we have written a small Kotlin based application, which accepts a config.json where you can specify the cluster config and if smtp config is specified, it will keep on polling the cluster based on a specified recursion interval specified and will send mail based alerts.

If it fits your use-case, please do use and do raise issues in-case you face any.

The link to the repo is as follows: https://github.com/gunjdesai/kafka-connect-monit

The image is also pushed on Docker Hub and you run it directly using the following command.

docker run -d -v <location-of-your-config-file.json>:/home/code/config.json gunjdesai/kafka-connect-monit

Hope this maybe helpful to you

gunj_desai
  • 782
  • 6
  • 19
0

Since this post was written/answered, Kafka Connect began providing its own official metrics. The Apache Kafka Connect provides metrics in legacy JMX format.

If you use the Confluent Kafka Connect Helm Charts (https://github.com/confluentinc/cp-helm-charts/tree/master/charts/cp-kafka-connect), they include a Prometheus metrics exporter.

I monitor and alert on cp_kafka_connect_connect_connector_metrics{status="running"} from the Confluent Helm Chart Prometheus chart, but there are many variations to that.

Using the official Kafka Connect metrics is generally preferable for any automated monitoring + alerting setup. This option wasn't available back when this post was written + answered.

FYI, Kafka still doesn't expose lag metrics, so you still need third party options to monitor and alert on lag.

clay
  • 18,138
  • 28
  • 107
  • 192
0

I know it's a bit late, but this might complete what people suggested here, one way to improve your KC cluster monitoring would be to use this Kafka Connect REST extension : https://github.com/LoObp4ck/kafka-connect-healthchecks

And then have your monitoring periodic job to check for this endpoint to ensure all connectors tasks are running fine. We use it in production and it does the job

The jar is also available in maven central as follows :

<dependency>
  <groupId>net.loobpack.kafka-connect-healthchecks</groupId>
  <artifactId>kafka-connect-healthcheck-extension</artifactId>
  <version>1.0.0</version>
</dependency>
Yannick
  • 1,240
  • 2
  • 13
  • 25