What is the recommended architecture for using amazon neptune in a scalable way?

Question

I am building an application backed by a Neptune database. Because I want the application to be scalable, I am using AWS Lambda + API gateway to build a REST API to interact with the database. This seems to be a reasonable idea based on the fact that this use case is documented in the Neptune docs.

The Neptune docs recommend reusing the websocket connection to the database across the entire execution context of the function, which is what I am doing at the moment. The docs also recommend resetting the connection and retrying upon errors (see here), which I am also using. However, I am seeing exceptions every now and then (perhaps every 20 requests on average). One of the exceptions I get is

ConnectionResetError: Cannot write to closing transport

which seems to be the same as this issue.

The other one is:

Traceback (most recent call last):
  File "/var/task/chalice/app.py", line 1685, in _get_view_function_response
    response = view_function(**function_args)
  File "/var/task/app.py", line 57, in resource
    return Resource(app.current_request, g).process()
  File "/var/task/backoff/_sync.py", line 94, in retry
    ret = target(*args, **kwargs)
  File "/var/task/chalicelib/handlers/resource.py", line 106, in get
    values = resources.valueMap().with_(WithOptions.tokens).toList()
  File "/var/task/gremlin_python/process/traversal.py", line 57, in toList
    return list(iter(self))
  File "/var/task/gremlin_python/process/traversal.py", line 47, in __next__
    self.traversal_strategies.apply_strategies(self)
  File "/var/task/gremlin_python/process/traversal.py", line 548, in apply_strategies
    traversal_strategy.apply(traversal)
  File "/var/task/gremlin_python/driver/remote_connection.py", line 63, in apply
    remote_traversal = self.remote_connection.submit(traversal.bytecode)
  File "/var/task/gremlin_python/driver/driver_remote_connection.py", line 60, in submit
    results = result_set.all().result()
  File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/var/task/gremlin_python/driver/resultset.py", line 90, in cb
    f.result()
  File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/var/lang/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/var/lang/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/var/task/gremlin_python/driver/connection.py", line 82, in _receive
    data = self._transport.read()
  File "/var/task/gremlin_python/driver/aiohttp/transport.py", line 104, in read
    raise RuntimeError("Connection was already closed.")
RuntimeError: Connection was already closed.

In case it is relevant, I am using gremlingpython==3.5.1

It seems to me that these issues are all ultimately a consequence of using AWS Lambda, namely due to the mismatch between the longevity of websocket connections and the ephemeral nature of lambda execution contexts. The question then is: Am I doing the wrong thing by trying to use AWS lambda for my API? Would it be more appropriate to setup an EC2 instance and deal with the scalability in some other way?

P.S. Previously I did create and close a connection in every function execution (as previously recommended in the Neptune docs), which did work fine but was naturally slow.

score 0 · Accepted Answer · answered Nov 03 '21 at 13:23

The latest version of Neptune only supports Gremlin 3.4.11 (https://docs.aws.amazon.com/neptune/latest/userguide/engine-releases-1.0.5.1.html). I would start by using gremlin-python 3.4.11 and see if that resolves your issue. Gremlin-python 3.5 replaced Tornado with AIO HTTP (ref) for websocket connections and I suspect that change may be causing a slight change in behavior that a future release supporting Gremlin 3.5 will address.

score 0 · Answer 2 · answered Nov 03 '21 at 13:39

0

I wonder whether the 'Connection was already closed' error message is not being treated as a retriable error by the retry logic?

What happens if you add this error message to the list of retriable_error_msgs in the Python example in the docs?

answered Nov 03 '21 at 13:39

Ian Robinson

156
2

Thanks for the suggestion—I did indeed tried that before, but the issue persisted. – nrg Nov 03 '21 at 16:33

What is the recommended architecture for using amazon neptune in a scalable way?

2 Answers2