I have been recently challenged with an architectural problem. Basically, I developed a Node.js application that fetches three zip files from Census.gov (13 MB, 1.2 MB and 6.7 GB) which takes about 15 to 20 minutes. After the files are downloaded the application unzips these files and extracts needed data to an AWS RDS Database. The issue for me is that this application needs to run only one time each year. What would be the best solution for this kind of task? Also, the zip files are deleted after the processing is done.
-
1We need to know the environment where the download task needs to run. AWS has scheduled tasks. Windows has the Windows Task Scheduler. *nix varieties have various chron job capabilities. If you're running this all on AWS, then just search for "AWS scheduled tasks" and find the tool that matches the kind of AWS service that you're using. – jfriend00 Aug 21 '22 at 16:51
-
@jfriend00 Question is far to open-ended. One thing that duct-tape programmers learn in the school of hard knocks is that such an infrequent task could justify an internal system that 1. Checks the date and time. 2. Checks the database if the node.js job was successfully done. 3. If it's due, then do the deed. Finally 4. If we're all topped off then report success and do nothing. That way if the computer/network/system is down during duedate, then when it comes back up, it hurries to get caught up. In event of failure, it tries again and again until all health checks pass. – Eric Leschinski Aug 21 '22 at 16:55
4 Answers
I would use a cron job. You can use this website (https://crontab.guru/every-year) to determine the correct settings for the crontab.
0 0 1 12 1
This setting will run “At 00:00 on day-of-month 1 and on Monday in December.”
To run the nodeJS program you simply put node yourcode.js aftewards. So it would look like the code below. Where node is you may need to put the path to the node program, and where yourprogram.js is you simply need to add the path there as well.
0 0 1 12 1 node yourprogram.js

- 8,659
- 12
- 83
- 154
Hei, I would give u suggestion. But according what Services do you use. In example if using Google Cloud with Google Scheduller. If using Openshift or another u can use Cronjob. But it worst case configuration I think where u need make some yaml file deployment that need trigger to publisher/subscriber:
- Make some subscriber, on services which can trigger by Google PubSub by Topic to do your task and after all executed publish to the broker (Google PubSuB) again.
- And than make another subscriber to trigger deleting file after receive a publisher message if all task execute.
The Idea i suggest because the process like that, it best practices if using the Asyncrhrounus process.
Thanks,

- 277
- 1
- 4
I would look into AWS Batch service which can run a scheduled job on an EC2 instance (virtual machine) or Fargate (serverless container runner).
Alternative #2: Use AWS Lambda serverless function to execute a NodeJS script (no need to set up an EC2 Instance or Fargate). Lambda functions can be triggered by EventBridge Rules using cron expressions. With Lambda, you pay for number of executions and the execution time in 1ms increments, however this use case could be covered within the AWS Free Tier Lambda pricing. AWS Free Tier
- Note on Lambda limits: Lambda execution time is limited to 15 minutes and 10GB of local storage maximum (source: Lambda Quotas). Lambda CPU is allocated in proportion to memory configuration, you may need to increase it to improve execution time. Lambda Memory Configuration
Alternative #3: You can build a state machine using AWS Step Functions to trigger Lambda functions in steps.
- For example, a state machine can trigger three Lambda functions in parallel where each function downloads its corresponding .zip file from census.gov and stores it to an Amazon S3 bucket. When all functions complete, the state machine can progress to next step and trigger a fourth function to grab data from S3 for processing and loading into the database. Once the data has been processed and loaded, the final step function can delete the .zip files from S3 if you no longer need them. EventBridge can also be used here to execute the state machine using a cron expression. You can also use Amazon SNS to publish notifications (email/sms/http endpoint) to alert if any step fails/completes.

- 66
- 2
- 4
The simple solution is to Schedule AWS Lambda Functions Using CloudWatch Events
So, you will have an AWS lambda function that will download the .zip
files in the S3 buckets, unzip it and extract the data to database. After that, the same function can empty the S3 buckets.
This function will be yearly trigger by CloudWatch Events.
For more information, check out this tutorial here

- 1,106
- 1
- 9
- 25
-
That is a good solution but we need to take into account the time taken for lamda to time out. The download being around 7GB takes time. Its not that AWS would not be able to download fast but the Census.gov has a slow sending speed and with almost half an hour on downloading we might have a timeout for lambda. – Sam Aug 28 '22 at 21:21
-
If there is a constraint regarding timeout for lambda, then you can use a server (EC2 instance), which have endpoint that will be yearly trigger by CloudWatch Events. That endpoint will download the .zip files in the S3 buckets and do the required stuff. – Abdullah Danyal Sep 01 '22 at 11:29