7

We're using AWS Lambda (Java) and the elasticsearch client to connect to a hosted elasticsearch instance on AWS. I encounter a long wait on the first request of about 2.5 seconds (on top of a cold start). After that it is very quick. I can't really figure out where this delay is coming from and I'm trying to optimize it.

private void testPerformanceElasticSearch() throws Exception {
        log.info("1. Before testing elasticsearch client");
        AWS4Signer signer = new AWS4Signer();
        signer.setServiceName("es");
        signer.setRegionName("eu-west-1");
        HttpRequestInterceptor interceptor = new AWSRequestSigningApacheInterceptor("es", signer, new DefaultAWSCredentialsProviderChain());
        String endpoint = "https://" + Utils.getEnvironmentVariable("ELASTIC_SEARCH_ENDPOINT");
        RestHighLevelClient restHighLevelClient = new RestHighLevelClient(RestClient.builder(HttpHost.create(endpoint)).setHttpClientConfigCallback(hacb -> hacb.addInterceptorLast(interceptor)));
        log.info("2. After getting elasticsearch client");
        log.info("3. Before doing a elasticsearch query");
        log.info("4");
        BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
        log.info("5");
        TermsQueryBuilder termsQueryBuilder = QueryBuilders.termsQuery("userId", "abc");
        log.info("6");
        boolQueryBuilder.must(termsQueryBuilder);
        log.info("7");
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        log.info("8");
        searchSourceBuilder.query(boolQueryBuilder);
        log.info("9");
        SearchRequest searchRequest = new SearchRequest("users");
        log.info("10");
        searchRequest.source(searchSourceBuilder);
        log.info("11");
        restHighLevelClient.search(searchRequest);
        log.info("12");
        log.info("13. After testing elasticsearch");
}

And then I get logging like this; you can see between '5' and '6' there is more than a 2 second delay which I can't really place:

17:16:06.871INFO[PlacesPerformance] 1. Before testing elasticsearch client
17:16:06.932INFO[PlacesPerformance] 2. After getting elasticsearch client
17:16:06.933INFO[PlacesPerformance] 3. Before doing a elasticsearch query
17:16:06.935INFO[PlacesPerformance] 4
17:16:06.942INFO[PlacesPerformance] 5
17:16:09.179INFO[PlacesPerformance] 6
17:16:09.179INFO[PlacesPerformance] 7
17:16:09.181INFO[PlacesPerformance] 8
17:16:09.181INFO[PlacesPerformance] 9
17:16:09.183INFO[PlacesPerformance] 10
17:16:09.183INFO[PlacesPerformance] 11
17:16:09.362INFO[PlacesPerformance] 12
17:16:09.362INFO[PlacesPerformance] 13. After testing elasticsearch

Any suggestions on how to improve this?

UPDATE:

Strange. Whenever I run the code in a lambda, I experience the 2.5 second delay when constructing the request (not even executing it). Locally, it works fine though. I tried the following:

1.  Local against local elasticsearch. No delay.
2.  Local against AWS elasticsearch. No delay.
3.  Lambda with signing request. DELAY.
4.  Lambda without signing request. DELAY.
5.  Lambda with a 'match all' query. DELAY
6.  Lambda with a http address. DELAY.
7.  Lambda with a custom runtime. DELAY.
8.  Lambda with a custom runtime. DELAY.
9.  Lambda with standard Java 8 runtime. DELAY.
evdh
  • 229
  • 2
  • 7
  • That's why it's called cold start, you would need to look for another kind of solution like heart beat pattern. – Hosar Apr 29 '19 at 01:45

2 Answers2

6

The problem is that at the first request (real request, not warmup request as warmup requests don't go through your application code, it doesn't trigger loading classes which are used in actual request path) JVM loads (read, parse, verify, etc ...) related classes, initializes security components (ciphers, etc ...) and TLS handshake is done (requires multiple RTT, with Java 9 and TLS 1.3 this should be reduced).

The similar long duration behaviour is also seen for first AWS service calls (DynamoDB, SQS, etc ...)

As I am the author Thundra warmup plugin, I am thinking of introducing hook points for warmup messages as custom action will be able to executed like initializing security components, loading classes, etc ...

serkan_ozal
  • 194
  • 6
  • Yeah I'm already doing exactly that. I've defined a 'warmup' interface that all my handlers implement. Then, whenever the lambda is called by thundra, this will be called and all clients are initialized. So whenever a 'real' request comes in, everything is ready to go and requests will be quick. So in this case, even my elasticsearch problem isn't really a problem, but I also wanted to make sure that cold starts are quicker as there is always a change that users scale up quick and a cold start is needed. Regarding some more elasticsearch client performance investigation, see above. – evdh Mar 05 '19 at 14:06
0

Lambda functions inside VPCs have a great impact on the startup time. You said your ES is a hosted instance, so I assume it's backed by a VPC.

Even if it's not in a VPC, Java cold starts are usually, by nature, longer than runtimes like Node or Python, because the JVM needs to be started up first. This is mainly where your 2.5 seconds come from.

OK. How to fix the issue?

It depends on how many concurrent connections you need to ElasticSearch. If one function is able to handle all the incoming requests, you can then limit the concurrent execution of your Lambda function to 1, so you make sure you are always hitting the same container (as long as these requests are made in a ±5 min time frame).

Now, if you don't know upfront how many concurrent Lambda functions will execute, you kind of have no way out. You could try warming up your Lambda functions beforehand, but then you'd need to fire like 100 requests at the same time to warm up 100 different containers.

Please check this answer as I go through the concurrent model of Lambda functions and how the cold/warm starts work.

I am happy to edit my answer if you have more info to share or if I wasn't clear enough.

Thales Minussi
  • 6,965
  • 1
  • 30
  • 48
  • 1
    Thanks! The lambdas are not running in a VPC and I use the managed Elasticsearch by AWS (https://aws.amazon.com/elasticsearch-service). The problem isn't cold starts; I'm fully aware of those and warm them up (https://docs.thundra.io/docs/how-to-warmup). As you can see above, the JVM is started, all code loaded and the constructor of the lambda called. The issue is while trying to make the first call, it suddenly waits for 2 seconds before continuing. It also does it in a strange place; I would have expected it more when instantiating the client or high level rest client but it waits after. – evdh Mar 05 '19 at 06:00
  • Sorry for not fully understanding the problem at a first glance. So, after your first call, if you execute it again, it then just runs immediately? – Thales Minussi Mar 05 '19 at 06:12
  • Exactly. So it must be something in the elasticsearch client, but I can't figure out where the delays occurs so can't try to optimize it either. – evdh Mar 05 '19 at 07:41
  • By checking your logs, this line is the offending one: TermsQueryBuilder termsQueryBuilder = QueryBuilders.termsQuery("userId", "abc"); Can't you try querying your whole index without specifying anything just to try to understand where the bottleneck is? – Thales Minussi Mar 05 '19 at 07:49
  • Done that, see my updated question above. Quite strange why it happens only when deployed to a lambda and not locally. – evdh Mar 05 '19 at 14:49