1

I've got a process, which constantly creates files. You can imaging some data stream and files are created by minutes from this stream. I've got hand-written code which reads files and upload them a to GCP bucket. Recently, I came across GCP Transfer Service for on-premises data. It also doing the same. I can configure a job and run it every 5 minutes to transfer new files. But I'm worried if this services was checked for such scenarios (transferring a lot of files infinitely and running transfer jobs every 5 minutes).

How reliable is this service?

Alexander Goida
  • 305
  • 1
  • 11

1 Answers1

2

Scheduling interval for transfer jobs:

Transfer service for on-premises data does not support sub-hourly change detection. Transfer service for on-premises data is a batch data movement service that can scan the source with a frequency of up to once an hour. Here is the little detail described for scheduling Transfer jobs, which says,

“repeatInterval in Scheduling specifies the Interval between the start of each scheduled TransferOperation. If unspecified, the default value is 24 hours. This value may not be less than 1 hour.”

There is already a feature request raised to let users define granularity for the Transfer service like 5 mins( as in your case) but there is no defined ETA for the same : https://issuetracker.google.com/122657858

Object-size limitation:

Cloud Storage supports a maximum single-object size up to 5 terabytes. If you have objects larger than 5TB, the object transfer fails for those objects for Transfer for on-premises.If an object's data is updated during a transfer, Transfer for on-premises attempts the upload again. If the upload fails multiple times, Transfer for on-premises logs a FILE_MODIFIED_FAILURE. For more information, see Troubleshooting Transfer for on-premises.

Number of files that can be transferred/Bandwidth :

If you have a larger data set than these limits, we recommend that you split your data across multiple transfer jobs.We currently support large directories, as long as every agent has at least 1GB of memory available for every 1 million files in the largest directory, so we can iterate over the directory contents without exceeding memory. We support up to 100 agents for a single transfer project. It is unlikely that you'll need more agents to achieve better performance given typical on-premises environments. Bandwidth limits are applied at an agent pool level and are divided by all agents in the pool. Your network up-links are not saturated as a result of using Transfer service for on-premises data and also your organization's existing application behavior doesn't degrade during the transfer.

Transfer service for on-premises data supports individual transfers that are:

  • Hundreds of terabytes in size
  • Up to 1 billion files
  • Several 10s of Gbps in transfer speed

While you are running a transfer, you can add agents to increase performance. Newly started agents join the assigned agent pool and perform work from existing transfers.

Priyashree Bhadra
  • 3,182
  • 6
  • 23
  • 2
    Very thorough clarification. Thank you for this. And I've got one point about jobs w/o schedule. I can create a job which run once. And then trigger it every 5 minutes with the command: `gcloud transfer jobs run ${JOB_NAME}`. It discovers new files on every run. Would it work in such case as intended long time? For example, I can create a cron job which would call the command. – Alexander Goida Apr 12 '22 at 13:59
  • @AlexanderGoida did the cron job work for you? I have a similar use case – Gabio Mar 09 '23 at 08:01
  • @AlexanderGoida cron job should work. Also you can use Cloud Scheduler, choose target type as HTTP, URL - https://storagetransfer.googleapis.com/v1/transferJobs/:run?alt=json and HTTP method as POST since the STS run API requires HTTP request with POST method. You will get the in the Configuration page of Data Transfer service – Priyashree Bhadra Mar 14 '23 at 16:05
  • 1
    @Gabio yes, I'm using Cloud Scheduler to trigger workflows. You may check here my approach https://medium.com/nerd-for-tech/distributed-computing-on-spark-with-dataproc-1322d8be4bc – Alexander Goida Mar 16 '23 at 07:54