2

The aws s3 sync command in the CLI can download a large collection of files very quickly, and I can not achieve the same performance with the AWS Go SDK. I have millions of files in the bucket so this is critical to me. I need to use the list pages command as well so that I can add a prefix which is not supported well by the sync CLI command.

I have tried using multiple goroutines (10 up to 1000) to make requests to the server, but the time is just so much slower compared to the CLI. It takes about 100 ms per file to run the Go GetObject function which is unacceptable for the number of files that I have. I know that the AWS CLI also uses the Python SDK in the backend, so how does it have so much better performance (I tried my script in boto as well as Go).

I am using ListObjectsV2Pages and GetObject. My region is the same as the S3 server's.

    logMtx := &sync.Mutex{}
    logBuf := bytes.NewBuffer(make([]byte, 0, 100000000))

    err = s3c.ListObjectsV2Pages(
        &s3.ListObjectsV2Input{
            Bucket:  bucket,
            Prefix:  aws.String("2019-07-21-01"),
            MaxKeys: aws.Int64(1000),
        },
        func(page *s3.ListObjectsV2Output, lastPage bool) bool {
            fmt.Println("Received", len(page.Contents), "objects in page")
            worker := make(chan bool, 10)
            for i := 0; i < cap(worker); i++ {
                worker <- true
            }
            wg := &sync.WaitGroup{}
            wg.Add(len(page.Contents))
            objIdx := 0
            objIdxMtx := sync.Mutex{}
            for {
                <-worker
                objIdxMtx.Lock()
                if objIdx == len(page.Contents) {
                    break
                }
                go func(idx int, obj *s3.Object) {
                    gs := time.Now()
                    resp, err := s3c.GetObject(&s3.GetObjectInput{
                        Bucket: bucket,
                        Key:    obj.Key,
                    })
                    check(err)
                    fmt.Println("Get: ", time.Since(gs))

                    rs := time.Now()
                    logMtx.Lock()
                    _, err = logBuf.ReadFrom(resp.Body)
                    check(err)
                    logMtx.Unlock()
                    fmt.Println("Read: ", time.Since(rs))

                    err = resp.Body.Close()
                    check(err)
                    worker <- true
                    wg.Done()
                }(objIdx, page.Contents[objIdx])
                objIdx += 1
                objIdxMtx.Unlock()
            }
            fmt.Println("ok")
            wg.Wait()
            return true
        },
    )
    check(err)

Many results look like:

Get:  153.380727ms
Read:  51.562µs
Scott Stensland
  • 26,870
  • 12
  • 93
  • 104
quintin
  • 161
  • 13
  • 2
    is your go code running on aws compute (ec2, lambda)? If so, what are the ec2 instance types and specs? Or is your go code running on a computer outside of AWS boundry? – Taterhead Jul 31 '19 at 08:54
  • The code is running on a computer outside of AWS - my laptop. The `sync` command is definitely faster in this environment. I also ran it on a `c5.xlarge`. My script was slightly slower than the `sync` command but it got to a point where I just went with the longer running time. In other words, the time delta was better on the EC2 instance. – quintin Jul 31 '19 at 17:51
  • 1
    Another advantage running it inside the AWS boundary is cost. Because you are billed for all data transferee outside the AWS boundary. – Taterhead Jul 31 '19 at 18:23
  • Good point. I was only really testing a subset of data on my laptop, then did the big transfer (~3 million files) on the EC2 instance. – quintin Jul 31 '19 at 18:44

2 Answers2

2

Have you tried using https://docs.aws.amazon.com/sdk-for-go/api/service/s3/s3manager/?

iter := new(s3manager.DownloadObjectsIterator)
var files []*os.File
defer func() {
    for _, f := range files {
        f.Close()
    }
}()

err := client.ListObjectsV2PagesWithContext(ctx, &s3.ListObjectsV2Input{
    Bucket: aws.String(bucket),
    Prefix: aws.String(prefix),
}, func(output *s3.ListObjectsV2Output, last bool) bool {
    for _, object := range output.Contents {
        nm := filepath.Join(dstdir, *object.Key)
        err := os.MkdirAll(filepath.Dir(nm), 0755)
        if err != nil {
            panic(err)
        }

        f, err := os.Create(nm)
        if err != nil {
            panic(err)
        }

        log.Println("downloading", *object.Key, "to", nm)

        iter.Objects = append(iter.Objects, s3manager.BatchDownloadObject{
            Object: &s3.GetObjectInput{
                Bucket: aws.String(bucket),
                Key:    object.Key,
            },
            Writer: f,
        })
        files = append(files, f)
    }

    return true
})
if err != nil {
    panic(err)
}

downloader := s3manager.NewDownloader(s)
err = downloader.DownloadWithIterator(ctx, iter)
if err != nil {
    panic(err)
}
Caleb
  • 9,272
  • 38
  • 30
  • The only difference I can tell is using the `s3manager.DownloadObjectsIterator`? I tried that initially but I did not end up using it because it is practically the same process except it uses multiple goroutines to download one file, instead of multiple goroutines to download multiple files. They are a lot of small files, so I figured only one goroutine per file was necessary, and that I should shoot off multiple requests at once for multiple files. I think the real speed advantage is those concurrent GET requests. – quintin Jul 31 '19 at 17:54
0

I ended up settling for my script in the initial post. I tried 20 goroutines and that seemed to work pretty well. On my laptop, the initial script is definitely slower than the command line (i7 8-thread, 16 GB RAM, NVME) versus the CLI. However, on the EC2 instance, the difference was small enough that it was not worth my time to optimize it further. I used a c5.xlarge instance in the same region as the S3 server.

quintin
  • 161
  • 13