1

I'm trying to get an ML job to run on AWS Batch. The job runs in a docker container, using credentials generated for a Task IAM Role.

I use DVC to manage the large data files needed for the task, which are hosted in an S3 repository. However, when the task tries to pull the data files, it gets an access denied message.

I can verify that the role has permissions to the bucket, because I can access the exact same files if I run an aws s3 cp command (as shown in the example below). But, I need to do it through DVC so that it downloads the right version of each file and puts it in the expected place.

I've been able to trace down the problem to s3fs, which is used by DVC to integrate with S3. As I demonstrate in the example below, it gets an access denied message even when I use s3fs by itself, passing in the credentials explicitly. It seems to fail on this line, where it tries to list the contents of the file after failing to find the object via a head_object call.

I suspect there may be a bug in s3fs, or in the particular combination of boto, http, and s3 libraries. Can anyone help me figure out how to fix this?

Here is a minimal reproducible example:

Shell script for the job:

#!/bin/bash

AWS_CREDENTIALS=$(curl http://169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI)

export AWS_DEFAULT_REGION=us-east-1

export AWS_ACCESS_KEY_ID=$(echo "$AWS_CREDENTIALS" | jq .AccessKeyId -r)

export AWS_SECRET_ACCESS_KEY=$(echo "$AWS_CREDENTIALS" | jq .SecretAccessKey -r)

export AWS_SESSION_TOKEN=$(echo "$AWS_CREDENTIALS" | jq .Token -r)

echo "AWS_ACCESS_KEY_ID=<$AWS_ACCESS_KEY_ID>"

echo "AWS_SECRET_ACCESS_KEY=<$(cat <(echo "$AWS_SECRET_ACCESS_KEY" | head -c 6) <(echo -n "...") <(echo "$AWS_SECRET_ACCESS_KEY" | tail -c 6))>"

echo "AWS_SESSION_TOKEN=<$(cat <(echo "$AWS_SESSION_TOKEN" | head -c 6) <(echo -n "...") <(echo "$AWS_SESSION_TOKEN" | tail -c 6))>"

dvc doctor

# Succeeds!
aws s3 ls s3://company-dvc/repo/

# Succeeds!
aws s3 cp s3://company-dvc/repo/00/0e4343c163bd70df0a6f9d81e1b4d2 mycopy.txt

# Fails!
python3 download_via_s3fs.py

download_via_s3fs.py:

import os

import s3fs

# Just to make sure we're reading the credentials correctly.
print(os.environ["AWS_ACCESS_KEY_ID"])
print(os.environ["AWS_SECRET_ACCESS_KEY"])
print(os.environ["AWS_SESSION_TOKEN"])

print("running with credentials")
fs = s3fs.S3FileSystem(
    key=os.environ["AWS_ACCESS_KEY_ID"],
    secret=os.environ["AWS_SECRET_ACCESS_KEY"],
    token=os.environ["AWS_SESSION_TOKEN"],
    client_kwargs={"region_name": "us-east-1"}
)

# Fails with "access denied" on ListObjectV2
print(fs.exists("company-dvc/repo/00/0e4343c163bd70df0a6f9d81e1b4d2"))

Terraform for IAM role:

data "aws_iam_policy_document" "standard-batch-job-role" {
  # S3 read access to related buckets
  statement {
    actions = [
      "s3:Get*",
      "s3:List*",
    ]
    resources = [
      data.aws_s3_bucket.company-dvc.arn,
      "${data.aws_s3_bucket.company-dvc.arn}/*",
    ]
    effect = "Allow"
  }
}

Environment

  • OS: Ubuntu 20.04
  • Python: 3.10
  • s3fs: 2023.1.0
  • boto3: 1.24.59
  • Since you use DVC, take a look at its Python API. May be easier for your task than s3fs. See https://dvc.org/doc/api-reference – Jorge Orpinel Pérez Feb 04 '23 at 18:34
  • 1
    DVC uses s3fs to download files from s3. My original script was just attempting to run a `dvc pulll`, but it fails with the same access denied error, in exactly the same place in the s3fs library. My example is just a more direct way to show that the problem, in fact, lies with `s3fs`, and not with DVC itself. – Kevin Yancey Feb 05 '23 at 02:48
  • May be worth opening a ticket in https://github.com/iterative/dvc.org and/or https://github.com/fsspec/s3fs . – Jorge Orpinel Pérez Feb 08 '23 at 18:49
  • More discussion is happening also here https://discuss.dvc.org/t/running-dvc-on-aws-batch/1481/9 – Shcheklein Feb 09 '23 at 01:04

0 Answers0