Caching and invalidating AWS Lambda response

Question

I am trying to implement a solution on AWS which is as follows:

I have a crawler that will run once a day to index certain sites. I want to cache this data and expose it the the form of an API since after crawling, this data will not change for an entire day. After the crawler refetches, I want to invalidate and rebuild this cache to serve the updated data. I'm trying to use serverless architecture to build this.

Possible Solutions

It is clear that the crawler will run on AWS Lambda. What is unclear to me is how to manage the cache that will serve the data. Here are some solutions I thought of

S3 and Cloudfront for caching: After crawling, store the data in the form of .json files in S3 that will be cached using AWS Cloudfront. When the crawler refetches new data, it will rebuild these files and ask Cloudfront to invalidate the cache.
API Gateway DynamoDB: After Crawling store the data in DynamoDB which will be then served by API Gateway which is cached. The only problem here is how can I ask for this cache to be invalidated at the end of the day when the crawler re-crawls? Since the data will be static for a day, how can I not pay for the extra time that DynamoDB will be running (because if I implement caching on API Gateway, there will only one call to DynamoDB for caching after that it will be sitting idle for a day)

Is there any other way that I am missing?

Thanks!

You don't pay for DynamoDB 'sitting idle' for a day. – Noel Llevares Jul 26 '17 at 13:51 — Noel Llevares, Jul 26 '17 at 13:51

score 1 · Answer 1 · answered Jul 25 '17 at 10:34

1

You can store new data in different path in S3 that would include the date of creation. Maybe something like:

index_2017_08_11.json

Then there is no need to invalidate caches on the CloudFront side. Since to access these new objects you need to provide new URLs, old CloudFront cache won't be an issue. You can remove S3 files for a previous day using S3 TTL feature.

Another option is to set the Expires caching HTTP header to set when the data in cache should be invalidated:

The Expires header field lets you specify an expiration date and time using the format specified in RFC 2616, Hypertext Transfer Protocol -- HTTP/1.1 Section 3.3.1, Full Date, for example: Sat, 27 Jun 2015 23:59:59 GMT

You can set this header in API Gateway to specify when an object should be invalidated.

Since the data will be static for a day, how can I not pay for the extra time that DynamoDB will be running

If data is static, can you store it in S3 and use API Gateway to serve data from S3 instead of DynamoDB?

answered Jul 25 '17 at 10:34

Ivan Mushketyk

8,107
7
50
67

Since the URLs on the client side will be staticly coded, it would be better to invalidate the cache and ask Cloudfront to revalidate it so that the client URLs don't need to be changed. The data is static for one day, but will change everyday but storing json files in S3 seems more like a hack than a real solution. Is there anything else I can use in AWS for my use case? I want to use Lambda -> Cache somewhere for 1 day -> Serve cache through an API -> Invalidate after a day – Salman Hasrat Khan Jul 26 '17 at 06:12
What about this: Lambda crawler -> writes to S3; Client -> Cloudfront (caches data in edge) -> API Gateway -> Lambda (sets Expires header) -> S3. – Ivan Mushketyk Jul 26 '17 at 09:18
In this case you will have one request to Lambda since if you set "Expires" header, Cloudfront will keep cache till the end of the day and will perform another request only when cache will expire. – Ivan Mushketyk Jul 26 '17 at 09:19

Caching and invalidating AWS Lambda response

1 Answers1