1

I have a google app engine php55 service that periodically checks a public website and downloads a file. This file is typically small (<1MB). My simple app is based on the following:

<?php
$strSource = 'https://example.com/file.zip';

$strBucket = 'bucket-1234';
$strDirectory = '/path/to/file/'; // Google Cloud Storage directory
$strName = 'file.zip';
$strDestination = 'gs://' . $strBucket . '.appspot.com' . $strDirectory . $strName;

copy($strSource,$strDestination);
?>

I found this file occasionally is larger (over the 32MB response size limit). How do I write this script to handle the file whether it is 1MB or 100MB?

I see people recommend "Blobstore," which is something I do not have experience with. Even if I understood that solution (which seems to be focused on a very different use case), it does not appear to be available for PHP at all. Am I missing something?

J. Doe
  • 23
  • 5
  • Blobstore is currently not available in PHP. I've found [this post](https://stackoverflow.com/questions/53105488/app-engine-fails-to-download-greater-than-33mb), where a user recommended to serve urls from Cloud Storage directly, and this worked to them. – ericcco Oct 21 '19 at 14:27
  • Thank you for the comment, @eespinola ... I think we might be talking about two different things. I want to be clear that I am looking for a solution that would allow my Google App Engine service to download a file from a public web server. I think you are talking about a local machine being able to download a file from storage associated with a Google App Engine service (although probably not the simple Cloud Storage that I have been using). Am I correct in understanding the difference between what I am asking and what you are commenting on? Should I make that clearer in my question? – J. Doe Oct 21 '19 at 21:21
  • Sorry if I misunderstood your question and thanks for clarifying your point. As you mentioned in your post, and as [documented](https://cloud.google.com/appengine/docs/standard/php/outbound-requests#quotas_and_limits_for_url_fetch), the limit on the response in GAE is ~32MB. As workaround I would suggest you to use a Compute Engine Instance that regulary checks if there are new files available. If there is any, download it and upload it to GCS. I know that you would like to use GAE, but since the response size is a hard limit, this is the only way in order to do it. – ericcco Oct 22 '19 at 08:41
  • Using Compute Engine as a middle man to temporarily download files and then upload them to Cloud Storage is an interesting suggestion. How would that work exactly? – J. Doe Oct 22 '19 at 18:27
  • Would I need to move the portion of my App Engine service that currently downloads the file directly to my Cloud Storage bucket from App Engine into a simple Cron job on Compute Engine? Then, as a post-process, I would need to transfer the file from my Compute Engine disk to the Cloud Storage bucket (then delete the file from my Compute Engine disk)? Is this [superuser post](https://superuser.com/questions/969666/copy-files-from-google-compute-engine-instance-to-google-cloud-storage-bucket) the best bet? If this is what you are thinking, and you post with details, I will mark as a solution. – J. Doe Oct 22 '19 at 18:30

2 Answers2

0

I would recommend you to use a Compute Engine Instance, since GAE has the 32MB limit on the responses size. I found this post, where the user checks if there are new files available, and if there is some file, he upload directly to GCS.

In order to do it, and as specified in the documentation, you should create an instance in GCE,and install and configure the client library for the language that you are going to use (as you mentioned in your post that you were using PHP, all the links will be refering this language, but keep in mind that you can choose also other language as C++, Java, Python...).

You can find here an example in PHP about how to upload an object to GCS:

function upload_object($bucketName, $objectName, $source)
{
    $storage = new StorageClient();
    $file = fopen($source, 'r');
    $bucket = $storage->bucket($bucketName);
    $object = $bucket->upload($file, [
        'name' => $objectName
    ]);
    printf('Uploaded %s to gs://%s/%s' . PHP_EOL, basename($source), $bucketName, $objectName);
}

You can also find other samples in the Github repository from Google Cloud Platform.

Hope this helps!

ericcco
  • 741
  • 3
  • 15
  • 1
    thank you! This is very interesting. Obviously, using a GCE VM to temporarily download files seems far less elegant and efficient than using the serverless GAE solution, but I will explore this and the other solutions described in [the post you referenced](https://stackoverflow.com/q/45061496/9666113) and let you know if I can successfully implement. I see another suggestion in the other post is to use Cloud Functions. However, am I interpreting [the limits correctly that a CF solution would fail for files >10MB](https://cloud.google.com/functions/quotas)? – J. Doe Oct 23 '19 at 19:38
  • 1
    There is yet another suggestion, to use the [Cloud Storage Transfer Service](https://cloud.google.com/storage-transfer/docs/reference/rest/v1/TransferSpec#httpdata), but that requires the file size and md5 checksum of the remote file. When I run `curl -I 'https://example.com/file.zip'` it does not provide the md5 hash information, so I think this option is off the table. Right? – J. Doe Oct 23 '19 at 19:38
  • Indeed, you would need MD5 hash of the object in order to use Cloud Storage Transfer Service. Let me know if my recommendation about using GCE instances worked fine for you! :) – ericcco Oct 24 '19 at 09:36
  • before I accept this answer, can you confirm that the Cloud Function alternative mentioned in the first comment above will fail for files greater than 10MB? – J. Doe Oct 25 '19 at 23:37
  • You can check [here](https://cloud.google.com/functions/quotas#resource_limits) that the limit size for the requests and responses in Cloud Functions is 10MB. – ericcco Oct 28 '19 at 09:05
  • Thanks, @eespinola. I appreciate the explanations. I will go with this solution if it looks as though AWS also has similar limits on serverless. I am not familiar with the AWS ecosystem, but [I have a post here](https://stackoverflow.com/q/58598052/9666113) to see if an AWS Lambda function is a better longer-term solution to my needs. – J. Doe Oct 28 '19 at 21:09
0

Use Google Storage Transfer Service (STS). It can be called from the Google Cloud SDK from your existing App Engine application, and will transfer the files directly from S3 to GCS without hitting any of yout App Engine limits. Based on your description, I believe it meets your requirements:

  • No data transfer limit
  • Minimal code changes
  • "Serverless" and simple to configure

STS has some additional benefits:

  • Zero runtime cost. That is, App Engine simply makes the STS API call to start the transfer job which is handled by STS, so you're not billed for the time GAE would normally use to download/upload the files itself.
  • You can go even more serverless by invoking STS from a Cloud Function trigged by Cloud Scheduler. I doubt you'd save much on costs, but it sure would be a neat setup.

The GCP docs have a guide on How to set up a Transfer from Amazon S3 to Cloud Storage.


Additional notes:

Travis Webb
  • 14,688
  • 7
  • 55
  • 109
  • Thanks, @TravisWebb. However, the "file.zip" does not exist in an s3 bucket before my script runs. My app "periodically checks a public web server and downloads a file" to a bucket (e.g., from `https://example.com/file.zip` to GCS bucket). – J. Doe Oct 29 '19 at 05:12
  • So, you first check if it's there with GAE or GCF or whatever, and if it exists, then use STS to transfer it. – Travis Webb Oct 29 '19 at 13:13
  • Thanks, @TravisWebb. STS looks cool, but I still fail to see how it solves my problem. [See this](https://cloud.google.com/storage-transfer/docs/create-url-list), in order to use STS for my needs (to download files from a public website to CS bucket), I could use GAE service to check if files exist, but to actually transfer with STS I would need both the file size AND the md5 checksum. Since the md5 hash does not appear when I run `curl -I 'https://example.com/file.zip'` the only solution is to download the file to get the md5 value. Because of the 32MB response limit on GAE, I cannot do that. – J. Doe Oct 29 '19 at 18:33
  • Hi @TravisWebb. I don't know how many times I have to repeat myself, but I am not concerned with transferring data from an Amazon s3 bucket to a Google Cloud Storage bucket. My problem is with transferring a file from a public website to my Cloud Storage bucket (Or, if I find GCP cannot do this, I may switch to AWS and just write a Lambda function to transfer the file from a public website to an s3 bucket). Based on the responses I've seen on [my other post](https://stackoverflow.com/q/58598052/9666113), it seems likely that Lambda functions can download individual files of up to 512MB. – J. Doe Oct 29 '19 at 19:52
  • Ok, I see. I was mixing up this with your other related question about S3. – Travis Webb Oct 29 '19 at 20:32