8

I am trying to attach Sentry to our Flink cluster to track job execution. Sentry acts as a logger which captures messages and sends them to a central server. By default it captures all messages with level WARN or higher.

To get Sentry to catch all problems, I need to write a WARN or ERROR log message whenever an operator raises an uncaught exception. If the restart strategy fails, the execute() method in the Execution Environment will throw the final exception, which I can log appropriately. But I have yet not found a way to log exceptions which cause the job to restart. Flink logs them as INFO messages, but that makes them difficult to filter from the rest.

What is the appropriate way to handle uncaught exceptions in Flink jobs?

Tzanko Matev
  • 259
  • 1
  • 5

2 Answers2

2

From Flink's perspective, user code errors are expected and, hence, Flink does not log them on WARN or ERROR. WARN and ERROR are reserved for logging statements which indicate that something is wrong with Flink itself.

The best option to capture task failures is to grep for <TASK_NAME> switched from RUNNING to FAILED. That way you will be notified whenever <TASK_NAME> failed. Note, however, that it is not guaranteed that the logging statement will never change.

David Anderson
  • 39,434
  • 4
  • 33
  • 60
Till Rohrmann
  • 13,148
  • 1
  • 25
  • 51
  • Thank you for the explanation. Don't you think that this policy should change now when we can run a single job per job manager? From the user perspective it doesn't make such a difference between Flink failing and the job failing. If job errors are logged at WARN level that would simplify monitoring quite a lot. – Tzanko Matev Feb 04 '20 at 08:06
  • I don't think that a recoverable user code error would warrant for a WARN level logging statement if errors are expected, also not for the per-job mode. In general, I don't think that logging statements are the best way to monitor Flink. I would rather suggest to use Flink's metrics to do this. There are also metrics which tell you about the number of restarts which might be helpful: https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#availability – Till Rohrmann Feb 04 '20 at 16:11
0

Disclaimer: I am not (yet) big expert in flink

Exception and logs can be found.

General logs

Configure Flink logging (do not forget to add logback.xml in the resource folder). Do not forget to set necessary log settings in Flink application

After this, I was able to see log messages including info level. To log something "inside" jdbc consider:

JdbcSink.sink(
    "insert into my)table (id) values (?);",
    (statement, event) -> {
      Logger logger = LoggerFactory.getLogger(new Object(){}.getClass().getEnclosingClass());
      logger.info("LoggingWorks!!!!");
      try {
        statement.setString(1, UUID.randomUUID().toString());
      } catch (Exception ex) {
        throw new RuntimeException(ex);
      }
    },
    JdbcExecutionOptions.builder()        
        // omited
        .build(),
    new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()
        // omited
        .build()
);

LoggerFactory.getLogger(new Object(){}.getClass().getEnclosingClass()); - is not a perfect solution but add a possibility to debug.

Exception logs

I can not provide a general solution but in aws there is an Apache Flink dashboard button where you can see all running Flink applications inside the deployed application in aws. Just click on job name in the list and you will be able to see the exception page with errors.

Yes this is for AWS but I believe the same dashboard exists in on-premise cluster.

Cherry
  • 31,309
  • 66
  • 224
  • 364