1

Good day,

I am running a Flink (v1.7.1) streaming job on AWS EMR 5.20, and I would like to have all task_managers and job_manager's logs of my job in S3. Logback is used as recommended by the Flink team. As it is a long-running job, I want the logs to be:

  1. Copied to S3 periodically
  2. Rolling either on time or size or both (as there might be a huge amount of logs)
  3. Get cleaned from the internal disk of the EMR nodes (otherwise the disks will become full)

What I have tried are:

  1. Enabled logging to S3 when creating the EMR cluster
  2. Configured yarn rolling logs with: yarn.log-aggregation-enable, yarn.nodemanager.remote-app-log-dir, yarn.log-aggregation.retain-seconds, yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds
  3. Configured rolling logs in logback.xml:
    <appender name="ROLLING" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>${log.file}</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <fileNamePattern>%d{yyyy-MM-dd HH}.%i.log</fileNamePattern>
            <maxFileSize>30MB</maxFileSize>    
            <maxHistory>3</maxHistory>
            <totalSizeCap>50MB</totalSizeCap>
        </rollingPolicy>
        <encoder>
            <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{60} %X{sourceThread} - %msg%n</pattern>
        </encoder>
    </appender>

What I got/observed until now are:

  • (1) did help with periodically copying the logs file to S3
  • (2) seemed useless for me until now. Logs are only aggregated when the streaming job ended, and no rolling was observed.
  • (3) yielded some result, but not close to requirements yet:
    • the rolling logs are there in some cache folder (/mnt/yarn/usercache/hadoop/appcache/application_1549236419773_0002/container_1549236419773_0002_01_000002)
    • only the last rolling logs file is available in the usual YARN logs folder (/mnt/var/log/hadoop-yarn/containers/application_1549236419773_0002/container_1549236419773_0002_01_000002)
    • only the last rolling logs file is available in S3

In short, out of the 3 requirements I got, I could only either (1) or (2&3).

Could you please help me with this?

Thanks and best regards,

Averell

Averell
  • 793
  • 2
  • 10
  • 21

1 Answers1

0

From what I know, the auto-backup of logs to S3 that EMR supports will only work at the end of the job, since it's based on the background log-loader that was originally implemented by AWS for batch jobs. Maybe there's a way to get it to work for rolling logs, I just have never heard about it.

I haven't tried this myself, but if I had to then I'd probably try the following:

  1. Mount S3 on your EC2 instances via S3fs.
  2. Set up logrotate (or equivalent) to automatically copy and clean up the log files.

You can use a bootstrap action to automatically set up all of the above.

If S3fs gives you problems, then you can do a bit more scripting and directly use the aws s3 command to sync logs, and then remove them once they've been copied.

kkrugler
  • 8,145
  • 6
  • 24
  • 18
  • Thank you @kkrugler. As per this page https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html, logs in EMR are copied to S3 every 5 minutes. But it seems that only some specific folders are monitored. I am looking for a configuration place to change this. Will try your suggested solution as well. – Averell Feb 09 '19 at 22:56
  • @Averell Did u find a solution? I'm in the same spot. – Jack Mar 14 '19 at 13:26
  • @JackTuck: I haven't. Will update the post if I have found one. – Averell Mar 19 '19 at 11:10
  • 1
    @Averell I have setup cloudwatch in the interim. It does the trick. But is not ideal i just have config which looks at logs files from all yarn node/resource managers, taskmanagers, jobmanagers. And set TTL to 1 day. – Jack Mar 19 '19 at 12:00