0

Using the AWS s3 cpp sdk we are trying to read froma bucket using the code below. When we specify a small range using

Aws::S3::Model::GetObjectRequest object_request;
object_request.SetRange(std::to_string(position) + "-" + std::to_string(position + nbytes));

So something like 0 for start position and 4 for end position. We find that the read operation actually reads more bytes than we allocated into our buffer. So we have a file that is 69 bytes long. If we try to read the first 4 bytes from it the result that comes back from

auto results = this->s3Client->GetObject(object_request);

we find that the size of the actual read from the server was 69 bytes. The entire size of the file. Is there a minimum value that the sdk will attempt to read when you specify very small operations? Is this value documented somewhere?

This is the actual function below that is trying to read data from s3.

arrow::Status S3ReadableFile::Read(int64_t nbytes, int64_t* bytesRead, uint8_t* buffer) {
    Aws::S3::Model::GetObjectRequest object_request;

    object_request.SetBucket(bucketName);
    object_request.SetKey(key);
    object_request.SetRange(std::to_string(position) + "-" + std::to_string(position + nbytes));

    auto results = this->s3Client->GetObject(object_request);

    if (!results.IsSuccess()) {
        //TODO: Make bad arrow status here
        *bytesRead = 0;
        return arrow::Status::IOError("Unable to fetch object from s3 bucket.");
    } else {
        //byutes read should always be full amount
        *bytesRead = nbytes; //should almost always be nBytes
        memcpy(buffer, results.GetResult().GetBody().rdbuf(), *bytesRead);
        position += *bytesRead;
        return arrow::Status::OK();
    }
}

These are private members of the class S3ReadableFile

    std::shared_ptr<Aws::S3::S3Client> s3Client;
    std::string bucketName;
    std::string key;
    size_t position;
    bool valid;
flips
  • 308
  • 1
  • 2
  • 16
  • Sounds like it. Have you tried with a larger file (10MB or even 1GB)? Also I'd be wary of reading 4 bytes at a time seeing as you get charged per request. – user253751 Feb 08 '18 at 22:59
  • hooo!!! good freaking point. It might just be smarter to buffer small files to avoid that issue. I am curious if anyone does know what the minimum read size is. If I can't get a response on that I will just write up some code to test and find out. – flips Feb 08 '18 at 23:02
  • Also it is more likely to be a service limitation than an SDK limitation. – user253751 Feb 08 '18 at 23:03
  • I am pretty sure it is too. I did want to include HOW I was connecting on the off chance it was not. – flips Feb 08 '18 at 23:04

2 Answers2

2

The value of Range should be "bytes=0-4" See: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35

user2706071
  • 196
  • 9
0

make sure you don't do below,

# wrong way of using ss
std::stringstream ss("bytes=");
ss << beg << '-' << end;
object_request.SetRange(ss.str().c_str());

assume beg is 0, and end is 10,

this won't work since it will pass 0-10 to the SDK. if it doesn't comply with https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35, i.e., bytes=0-10, it will download all bytes.

the correct one is,

std::stringstream ss();
ss << "bytes=" << beg << '-' << end;
object_request.SetRange(ss.str().c_str());

It takes me a long time to figure out!

Izana
  • 2,537
  • 27
  • 33