How to copy EMR streaming job logs to S3 and clean logs on EMR core node's disk

Question

Good day,

I am running a Flink (v1.7.1) streaming job on AWS EMR 5.20, and I would like to have all task_managers and job_manager's logs of my job in S3. Logback is used as recommended by the Flink team. As it is a long-running job, I want the logs to be:

Copied to S3 periodically
Rolling either on time or size or both (as there might be a huge amount of logs)
Get cleaned from the internal disk of the EMR nodes (otherwise the disks will become full)

What I have tried are:

Enabled logging to S3 when creating the EMR cluster
Configured yarn rolling logs with: yarn.log-aggregation-enable, yarn.nodemanager.remote-app-log-dir, yarn.log-aggregation.retain-seconds, yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds
Configured rolling logs in logback.xml:

    <appender name="ROLLING" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>${log.file}</file>
        <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
            <fileNamePattern>%d{yyyy-MM-dd HH}.%i.log</fileNamePattern>
            <maxFileSize>30MB</maxFileSize>    
            <maxHistory>3</maxHistory>
            <totalSizeCap>50MB</totalSizeCap>
        </rollingPolicy>
        <encoder>
            <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{60} %X{sourceThread} - %msg%n</pattern>
        </encoder>
    </appender>

What I got/observed until now are:

(1) did help with periodically copying the logs file to S3
(2) seemed useless for me until now. Logs are only aggregated when the streaming job ended, and no rolling was observed.
(3) yielded some result, but not close to requirements yet:
- the rolling logs are there in some cache folder (/mnt/yarn/usercache/hadoop/appcache/application_1549236419773_0002/container_1549236419773_0002_01_000002)
- only the last rolling logs file is available in the usual YARN logs folder (/mnt/var/log/hadoop-yarn/containers/application_1549236419773_0002/container_1549236419773_0002_01_000002)
- only the last rolling logs file is available in S3

In short, out of the 3 requirements I got, I could only either (1) or (2&3).

Could you please help me with this?

Thanks and best regards,

Averell

score 0 · Answer 1 · answered Feb 04 '19 at 19:07

0

From what I know, the auto-backup of logs to S3 that EMR supports will only work at the end of the job, since it's based on the background log-loader that was originally implemented by AWS for batch jobs. Maybe there's a way to get it to work for rolling logs, I just have never heard about it.

I haven't tried this myself, but if I had to then I'd probably try the following:

Mount S3 on your EC2 instances via S3fs.
Set up logrotate (or equivalent) to automatically copy and clean up the log files.

You can use a bootstrap action to automatically set up all of the above.

If S3fs gives you problems, then you can do a bit more scripting and directly use the aws s3 command to sync logs, and then remove them once they've been copied.

answered Feb 04 '19 at 19:07

kkrugler

8,145
6
24
18

Thank you @kkrugler. As per this page https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html, logs in EMR are copied to S3 every 5 minutes. But it seems that only some specific folders are monitored. I am looking for a configuration place to change this. Will try your suggested solution as well. – Averell Feb 09 '19 at 22:56
@Averell Did u find a solution? I'm in the same spot. – Jack Mar 14 '19 at 13:26
@JackTuck: I haven't. Will update the post if I have found one. – Averell Mar 19 '19 at 11:10
1

@Averell I have setup cloudwatch in the interim. It does the trick. But is not ideal i just have config which looks at logs files from all yarn node/resource managers, taskmanagers, jobmanagers. And set TTL to 1 day. – Jack Mar 19 '19 at 12:00

How to copy EMR streaming job logs to S3 and clean logs on EMR core node's disk

1 Answers1