6

As I'm learning how to use graph with Cosmos DB, I found two Microsoft tutorials:

While I use the same query, its execution differs.

Using Gremlin.Net, it executes at once. I very often (I'd say 70% of the time) get a RequestRateTooLargeException. If I understand correctly, it means that I keep reaching the 400RU/s limit that I chose to start with. However, when the query goes trough, it is twice as fast a the solution with Microsoft.Azure.Graph.

Indeed, with Micorosft.Azure.Graph, I have to call ExecuteNextAsync in a loop which returns one result at a time.

So the questions are:
1°) Which method should I use for better performance?
2°) How can I know the RU of my query so I can fine tune it?
3°) Is it possible to increase the throughput of an existing collection?

Update

Re question 3, I found that in the "Data Explorer" blade of my database, there is a "Scale & Settings" for my graph where I can update the throughput.

Update2

Re question 2, we can't get the RU charged when using the first method (Gremlin.Net) but the Microsoft.Graph the method ExecuteNextAsync returns a FeedResponse with a field RequestCharge.

François
  • 3,164
  • 25
  • 58

2 Answers2

5

The reason you are hitting RequestRateTooLarge exceptions (429 status code) via Gremlin.NET vs Microsoft.Azure.Graphs is likely due to the difference between the retry policy on CosmosDB Gremlin server vs the default retry policy for DocumentClient.

The default retry behavior for DocumentClient with regards to these errors is described here:

By default, the DocumentClientException with status code 429 is returned after a cumulative wait time of 30 seconds if the request continues to operate above the request rate.

Therefore, Microsoft.Azure.Graphs may be internally handling and retrying these errors from the server and eventually succeeding. This has the benefit of improving request reliability but obfuscates the request rate errors, and will impact execution duration.

On CosmosDB Gremlin server, this retry policy allowance is reduced significantly, so RequestRateTooLargeException errors will be surfaced sooner.

To answer your questions:

1. Which method should I use for better performance?

Using CosmosDB Gremlin server via Gremlin.NET is expected to see better performance. Microsoft.Azure.Graphs uses a different request processing approach which involves more round-trips to the server so it has overhead, in addition to being a number of releases behind what is deployed to the server.

2. How can I know the RU of my query so I can fine tune it?

RU charges will be included in the Gremlin server responses as attributes. Currently Gremlin.NET doesn't have a way of exposing attributes on the response, however changes to the client driver are being discussed here.

In the interim, you an observe how frequently your requests hit 429 errors through the Metrics blade on your Azure CosmosDB Account portal. This presents aggregated views of number of requests, requests that exceeded capacity, storage quota etc. for a given collection.

3. Is it possible to increase the throughput of an existing collection?

As you found, you can increase throughput for an existing collection via the portal. Alternatively, this can be programmatically via Microsoft.Azure.Documents SDK.


In closing, my recommendation would be to add a retry policy around Gremlin.NET requests to handle these exceptions and match on RequestRateTooLargeException message.

When response status attributes are exposed on Gremlin.NET, they will include:

  • Request charge,
  • CosmosDB specific status code (eg. 429), and
  • Retry-after value, which indicates the time to wait in order to avoid hitting 429 errors.
Oliver Towers
  • 445
  • 2
  • 7
  • 1
    The response status attributes have finally made it into Gremlin.NET v3.4.0-rc2, released less than two weeks ago. You get exactly what's described above: *request charge* (RUs), *failure status code* (eg 429) and *retry-after*. – Cristian Diaconescu Oct 08 '18 at 16:42
  • Attribution: [mailing list](https://lists.apache.org/thread.html/5c31225ad90b1ae4637464bc5d89bcc47e113dc5694f60fa1a9c2cd0@%3Cdev.tinkerpop.apache.org%3E) and [GitHub issue](https://github.com/apache/tinkerpop/pull/933). Took me some time spelunking the discussion threads to find this. – Cristian Diaconescu Oct 08 '18 at 16:49
0

My understanding is that the Gremlin.Net driver is faster, however since cosmos doesn't support bytecode you're stuck with the document client for interacting with the db.

https://github.com/Azure/azure-documentdb-dotnet/issues/439

https://groups.google.com/forum/#!topic/gremlin-users/Ve-cEZed94o

  • Please write here the solution instead of including a link that could be broken in the future. Thanks! – Ignacio Ara Jun 01 '18 at 15:33
  • Neither are solutions, they're references to it being a known issue, otherwise I'd copy it over. I guess I could delete the links, but at least that first one is the actual github issue in case anyone else wants to follow it. – gabrielthursday Jun 01 '18 at 16:04