25

I am running a one-off Fargate Task that runs a small python script. The Task Definition is configured to use awslogs to send logs to Cloudwatch but I am facing a very strange intermittent issue.

Logs will sometimes appear in the newly created Cloudwatch stream and sometimes it won't. I have tried removing parts of my code and for now, here's what I have.

When I remove the asyncio/aiohttp fetching logic, the print statements appear normally in Cloudwatch Logs. Though since the issue is intermittent, I can't be 100% sure this will always happen.

However, with the fetching logic included, I sometimes get log streams that are completely empty after the Fargate task exits. No logs saying "Job starting", "Job ending" or "Putting file into S3". No error logs either. Despite this, when I check the S3 bucket, the file with the corresponding timestamp was created, indicating the script did run to completion. I can't fathom how this is possible.

dostuff.py

#!/usr/bin/env python3.6

import asyncio
import datetime
import time

from aiohttp import ClientSession
import boto3


def s3_put(bucket, key, body):
    try:
        print(f"Putting file into {bucket}/{key}")
        client = boto3.client("s3")
        client.put_object(Bucket=bucket,Key=key,Body=body)
    except Exception:
        print(f"Error putting object into S3 Bucket: {bucket}/{key}")
        raise


async def fetch(session, number):
    url = f'https://jsonplaceholder.typicode.com/todos/{number}'
    try:
        async with session.get(url) as response:
            return await response.json()
    except Exception as e:
        print(f"Failed to fetch {url}")
        print(e)
        return None


async def fetch_all():
    tasks = []
    async with ClientSession() as session:
        for x in range(1, 6):
            for number in range(1, 200):
                task = asyncio.ensure_future(fetch(session=session,number=number))
                tasks.append(task)
        responses = await asyncio.gather(*tasks)
    return responses


def main():
    try:
        loop = asyncio.get_event_loop()
        future = asyncio.ensure_future(fetch_all())
        responses = list(filter(None, loop.run_until_complete(future)))
    except Exception:
        print("uh oh")
        raise

    # do stuff with responses

    body = "whatever"
    key = f"{datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d_%H-%M-%S')}_test"
    s3_put(bucket="my-s3-bucket", key=key, body=body)


if __name__ == "__main__":
    print("Job starting")
    main()
    print("Job complete")

Dockerfile

FROM python:3.6-alpine
COPY docker/test_fargate_logging/requirements.txt /
COPY docker/test_fargate_logging/dostuff.py /
WORKDIR /
RUN pip install --upgrade pip && \
    pip install -r requirements.txt
ENTRYPOINT python dostuff.py

Task Definition

{
    "ipcMode": null,
    "executionRoleArn": "arn:aws:iam::xxxxxxxxxxxx:role/ecsInstanceRole",
    "containerDefinitions": [
        {
            "dnsSearchDomains": null,
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "test-fargate-logging-stg-log-group",
                    "awslogs-region": "ap-northeast-1",
                    "awslogs-stream-prefix": "ecs"
                }
            },
            "entryPoint": null,
            "portMappings": [],
            "command": null,
            "linuxParameters": null,
            "cpu": 256,
            "environment": [],
            "ulimits": null,
            "dnsServers": null,
            "mountPoints": [],
            "workingDirectory": null,
            "secrets": null,
            "dockerSecurityOptions": null,
            "memory": 512,
            "memoryReservation": null,
            "volumesFrom": [],
            "image": "xxxxxxxxxxxx.dkr.ecr.ap-northeast-1.amazonaws.com/test-fargate-logging-stg-ecr-repository:xxxxxxxxxxxxxxxxxxxxxxxxxxxx",
            "disableNetworking": null,
            "interactive": null,
            "healthCheck": null,
            "essential": true,
            "links": null,
            "hostname": null,
            "extraHosts": null,
            "pseudoTerminal": null,
            "user": null,
            "readonlyRootFilesystem": null,
            "dockerLabels": null,
            "systemControls": null,
            "privileged": null,
            "name": "test_fargate_logging"
        }
    ],
    "placementConstraints": [],
    "memory": "512",
    "taskRoleArn": "arn:aws:iam::xxxxxxxxxxxx:role/ecsInstanceRole",
    "compatibilities": [
        "EC2",
        "FARGATE"
    ],
    "taskDefinitionArn": "arn:aws:ecs:ap-northeast-1:xxxxxxxxxxxx:task-definition/test-fargate-logging-stg-task-definition:2",
    "family": "test-fargate-logging-stg-task-definition",
    "requiresAttributes": [
        {
            "targetId": null,
            "targetType": null,
            "value": null,
            "name": "ecs.capability.execution-role-ecr-pull"
        },
        {
            "targetId": null,
            "targetType": null,
            "value": null,
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
        },
        {
            "targetId": null,
            "targetType": null,
            "value": null,
            "name": "ecs.capability.task-eni"
        },
        {
            "targetId": null,
            "targetType": null,
            "value": null,
            "name": "com.amazonaws.ecs.capability.ecr-auth"
        },
        {
            "targetId": null,
            "targetType": null,
            "value": null,
            "name": "com.amazonaws.ecs.capability.task-iam-role"
        },
        {
            "targetId": null,
            "targetType": null,
            "value": null,
            "name": "ecs.capability.execution-role-awslogs"
        },
        {
            "targetId": null,
            "targetType": null,
            "value": null,
            "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
        },
        {
            "targetId": null,
            "targetType": null,
            "value": null,
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
        }
    ],
    "pidMode": null,
    "requiresCompatibilities": [
        "FARGATE"
    ],
    "networkMode": "awsvpc",
    "cpu": "256",
    "revision": 2,
    "status": "ACTIVE",
    "volumes": []
}

Observations

  • When I decrease the amount of tasks (urls to fetch) to say 10 instead ~1000, the logs seem to appear most/all(?) of the time. Again the issue is intermittent, so I can't be 100% sure.
  • My original script had additional logic for retrying fetching on failures, and parsing logic which I removed while troubleshooting. The logging behavior back then at least had the logs for "Job starting" and logs during the asynchronous aiohttp requests. However logs for writing to S3 and the final "Job complete" log appeared intermittently. With the simplified script above, I seem to be getting either all the logs, or none at all.
  • Issue was happening with python's logging library as well which I changed to print to rule out issues with logging
  • I'm experiencing the same issue. Did you find anything to workaround this? I'll let you know if I do. – Laurent Jalbert Simard Jan 18 '19 at 20:36
  • 1
    @LaurentJalbertSimard For the time being, I have lowered the number of concurrent requests but since I have some time, I may spin up some test infra to test some more. I have also posted the issue on AWS forums and noticed some other logging issues which may or may not be related: https://forums.aws.amazon.com/forum.jspa?forumID=187&start=0 –  Jan 21 '19 at 02:52

2 Answers2

0

The Problem

I have been experiencing the same issues; intermittently losing ECS Fargate tasks' logs in CloudWatch.

While I cannot answer as to why this is occurring, I can offer a workaround which I just tested out.

What worked for me:

Upgrading to version Python 3.7 (I see you are using 3.6. as was I when experiencing the same issues).

I am now seeing all my logs and am benefiting from the latest version of Python.

I hope this helps you as it helped me.

Captain
  • 36
  • 3
  • I'm also having this issue on Ruby2.5 so this workaround does not help me unfortunately. Have you discovered anything else? – Kevin May 24 '19 at 02:04
  • Also had this issue on python as well as on nodejs containers – tanvi Aug 19 '19 at 18:46
0

this issue seems to be solved now according to this AWS Forums link i had a similar issue, and there are some useful workarounds in the answers on this question: Missing log lines when writing to cloudwatch from ECS Docker containers

You should not be having this issue anymore. If you are, try to deploy a new version of your task definition, and that should fix it.

tanvi
  • 568
  • 2
  • 11
  • 32