1

I have problem with cosmos emulator while read or write documents in parallel. It throw 408 RequestTimeOutException. But it works when documents created or read one by one. I also already check and found this issue appears in other users with emulator, but I didn't find the solution. Any suggestion how to fix?

{
  "ClassName": "CosmosException",
  "userAgent": "azsdk-java-cosmos/4.30.1 Linux/5.13.0-52-generic JRE/1.8.0_312",
  "statusCode": 408,
  "resourceAddress": "https://172.19.0.1:8081/dbs/Test/colls/Entity/docs",
  "error": "{\"code\":\"RequestTimeout\",\"message\":\"Message: Request timed out. More info: https://aka.ms/cosmosdb-tsg-request-timeout\\r\\nActivityId: 4b389ccb-eb28-4359-a266-43542380743d, Request URI: /apps/DocDbApp/services/DocDbServer3/partitions/a4cb494f-38c8-11e6-8106-8cdcd42c33be/replicas/1p/, RequestStats: \\r\\nRequestStartTime: 2022-07-14T19:16:37.2339238Z, RequestEndTime: No response recorded; Current Time: 2022-07-14T19:16:48.0040567Z,  Number of regions attempted:1\\r\\n{\\\"systemHistory\\\":[{\\\"dateUtc\\\":\\\"2022-07-14T19:15:42.2854148Z\\\",\\\"cpu\\\":100.000,\\\"memory\\\":278016.000,\\\"threadInfo\\\":{\\\"isThreadStarving\\\":\\\"False\\\",\\\"threadWaitIntervalInMs\\\":0.0728,\\\"availableThreads\\\":32766,\\\"minThreads\\\":8,\\\"maxThreads\\\":32767}},{\\\"dateUtc\\\":\\\"2022-07-14T19:15:52.2978001Z\\\",\\\"cpu\\\":100.000,\\\"memory\\\":278792.000,\\\"threadInfo\\\":{\\\"isThreadStarving\\\":\\\"False\\\",\\\"threadWaitIntervalInMs\\\":0.0186,\\\"availableThreads\\\":32766,\\\"minThreads\\\":8,\\\"maxThreads\\\":32767}},{\\\"dateUtc\\\":\\\"2022-07-14T19:16:02.2997627Z\\\",\\\"cpu\\\":100.000,\\\"memory\\\":277472.000,\\\"threadInfo\\\":{\\\"isThreadStarving\\\":\\\"False\\\",\\\"threadWaitIntervalInMs\\\":0.0195,\\\"availableThreads\\\":32766,\\\"minThreads\\\":8,\\\"maxThreads\\\":32767}},{\\\"dateUtc\\\":\\\"2022-07-14T19:16:12.3148157Z\\\",\\\"cpu\\\":100.000,\\\"memory\\\":213544.000,\\\"threadInfo\\\":{\\\"isThreadStarving\\\":\\\"False\\\",\\\"threadWaitIntervalInMs\\\":0.0142,\\\"availableThreads\\\":32766,\\\"minThreads\\\":8,\\\"maxThreads\\\":32767}},{\\\"dateUtc\\\":\\\"2022-07-14T19:16:22.3187880Z\\\",\\\"cpu\\\":100.000,\\\"memory\\\":141324.000,\\\"threadInfo\\\":{\\\"isThreadStarving\\\":\\\"False\\\",\\\"threadWaitIntervalInMs\\\":0.0391,\\\"availableThreads\\\":32755,\\\"minThreads\\\":8,\\\"maxThreads\\\":32767}},{\\\"dateUtc\\\":\\\"2022-07-14T19:16:32.3371596Z\\\",\\\"cpu\\\":100.000,\\\"memory\\\":165764.000,\\\"threadInfo\\\":{\\\"isThreadStarving\\\":\\\"False\\\",\\\"threadWaitIntervalInMs\\\":0.0084,\\\"availableThreads\\\":32766,\\\"minThreads\\\":8,\\\"maxThreads\\\":32767}}]}\\r\\n, SDK: Microsoft.Azure.Documents.Common/2.14.0, StatusCode: RequestTimeout\",\"additionalErrorInfo\":null}",
  "innerErrorMessage": "Message: Request timed out. More info: https://aka.ms/cosmosdb-tsg-request-timeout\r\nActivityId: 4b389ccb-eb28-4359-a266-43542380743d, Request URI: /apps/DocDbApp/services/DocDbServer3/partitions/a4cb494f-38c8-11e6-8106-8cdcd42c33be/replicas/1p/, RequestStats: \r\nRequestStartTime: 2022-07-14T19:16:37.2339238Z, RequestEndTime: No response recorded; Current Time: 2022-07-14T19:16:48.0040567Z,  Number of regions attempted:1\r\n{\"systemHistory\":[{\"dateUtc\":\"2022-07-14T19:15:42.2854148Z\",\"cpu\":100.000,\"memory\":278016.000,\"threadInfo\":{\"isThreadStarving\":\"False\",\"threadWaitIntervalInMs\":0.0728,\"availableThreads\":32766,\"minThreads\":8,\"maxThreads\":32767}},{\"dateUtc\":\"2022-07-14T19:15:52.2978001Z\",\"cpu\":100.000,\"memory\":278792.000,\"threadInfo\":{\"isThreadStarving\":\"False\",\"threadWaitIntervalInMs\":0.0186,\"availableThreads\":32766,\"minThreads\":8,\"maxThreads\":32767}},{\"dateUtc\":\"2022-07-14T19:16:02.2997627Z\",\"cpu\":100.000,\"memory\":277472.000,\"threadInfo\":{\"isThreadStarving\":\"False\",\"threadWaitIntervalInMs\":0.0195,\"availableThreads\":32766,\"minThreads\":8,\"maxThreads\":32767}},{\"dateUtc\":\"2022-07-14T19:16:12.3148157Z\",\"cpu\":100.000,\"memory\":213544.000,\"threadInfo\":{\"isThreadStarving\":\"False\",\"threadWaitIntervalInMs\":0.0142,\"availableThreads\":32766,\"minThreads\":8,\"maxThreads\":32767}},{\"dateUtc\":\"2022-07-14T19:16:22.3187880Z\",\"cpu\":100.000,\"memory\":141324.000,\"threadInfo\":{\"isThreadStarving\":\"False\",\"threadWaitIntervalInMs\":0.0391,\"availableThreads\":32755,\"minThreads\":8,\"maxThreads\":32767}},{\"dateUtc\":\"2022-07-14T19:16:32.3371596Z\",\"cpu\":100.000,\"memory\":165764.000,\"threadInfo\":{\"isThreadStarving\":\"False\",\"threadWaitIntervalInMs\":0.0084,\"availableThreads\":32766,\"minThreads\":8,\"maxThreads\":32767}}]}\r\n, SDK: Microsoft.Azure.Documents.Common/2.14.0, StatusCode: RequestTimeout",
  "causeInfo": null,
  "responseHeaders": "{Transfer-Encoding=chunked, Server=Microsoft-HTTPAPI/2.0, Access-Control-Allow-Origin=, Access-Control-Allow-Credentials=true, x-ms-gatewayversion=version=2.14.0, x-ms-request-validation-failure=1, Date=Thu, 14 Jul 2022 19:16:48 GMT, x-ms-activity-id=4b389ccb-eb28-4359-a266-43542380743d, Content-Type=application/json}",
  "requestHeaders": "[x-ms-cosmos-intended-collection-rid=dOlHAJx8QaM=, Accept=application/json, x-ms-date=Thu, 14 Jul 2022 19:16:37 GMT, x-ms-documentdb-partitionkey=[\"001G000000voY7WIAU\"], x-ms-documentdb-is-upsert=true, x-ms-consistency-level=Session, Content-Type=application/json]",
  "cosmosDiagnostics": {
    "userAgent": "azsdk-java-cosmos/4.30.1 Linux/5.13.0-52-generic JRE/1.8.0_312",
    "activityId": "4b389ccb-eb28-4359-a266-43542380743d",
    "requestLatencyInMs": 10782,
    "requestStartTimeUTC": "2022-07-14T19:16:37.232Z",
    "requestEndTimeUTC": "2022-07-14T19:16:48.014Z",
    "responseStatisticsList": [],
    "supplementalResponseStatisticsList": [],
    "addressResolutionStatistics": {},
    "regionsContacted": [
      "south central us"
    ],
    "retryContext": {
      "statusAndSubStatusCodes": null,
      "retryLatency": 0,
      "retryCount": 0
    },
    "metadataDiagnosticsContext": {
      "metadataDiagnosticList": null
    },
    "serializationDiagnosticsContext": {
      "serializationDiagnosticsList": [
        {
          "serializationType": "ITEM_SERIALIZATION",
          "startTimeUTC": "2022-07-14T19:16:37.232Z",
          "endTimeUTC": "222-07-14T19:16:37.232Z",
          "durationInMicroSec": 0
        }
      ]
    },
    "gatewayStatistics": {
      "sessionToken": null,
      "operationType": "Upser0t",
      "resourceType": "Document",
      "statusCode": 408,
      "subStatusCode": 0,
      "requestCharge": 0.0,
      "requestTimeline": [
        {
          "eventName": "connectionAcquired",
          "startTimeUTC": "2022-07-14T19:16:37.233Z",
          "durationInMicroSec": 0
        },
        {
          "eventName": "connectionConfigured",
          "startTimeUTC": "2022-07-14T19:16:37.233Z",
          "durationInMicroSec": 0
        },
        {
          "eventName": "requestSent",
          "startTimeUTC": "2022-07-14T19:16:37.233Z",
          "durationInMicroSec": 0
        },
        {
          "eventName": "transitTime",
          "startTimeUTC": "2022-07-14T19:16:37.233Z",
          "durationInMicroSec": 10776000
        },
        {
          "eventName": "received",
          "startTimeUTC": "2022-07-14T19:16:48.009Z",
          "durationInMicroSec": 3000
        }
      ],
      "partitionKeyRangeId": null,
      "exceptionMessage": "Message: Request timed out. More info: https://aka.ms/cosmosdb-tsg-request-timeout\r\nActivityId: 4b389ccb-eb28-4359-a266-43542380743d, Request URI: /apps/DocDbApp/services/DocDbServer3/partitions/a4cb494f-38c8-11e6-8106-8cdcd42c33be/replicas/1p/, RequestStats: \r\nRequestStartTime: 2022-07-14T19:16:37.2339238Z, RequestEndTime: No response recorded; Current Time: 2022-07-14T19:16:48.0040567Z,  Number of regions attempted:1\r\n{\"systemHistory\":[{\"dateUtc\":\"2022-07-14T19:15:42.2854148Z\",\"cpu\":100.000,\"memory\":278016.000,\"threadInfo\":{\"isThreadStarving\":\"False\",\"threadWaitIntervalInMs\":0.0728,\"availableThreads\":32766,\"minThreads\":8,\"maxThreads\":32767}},{\"dateUtc\":\"2022-07-14T19:15:52.2978001Z\",\"cpu\":100.000,\"memory\":278792.000,\"threadInfo\":{\"isThreadStarving\":\"False\",\"threadWaitIntervalInMs\":0.0186,\"availableThreads\":32766,\"minThreads\":8,\"maxThreads\":32767}},{\"dateUtc\":\"2022-07-14T19:16:02.2997627Z\",\"cpu\":100.000,\"memory\":277472.000,\"threadInfo\":{\"isThreadStarving\":\"False\",\"threadWaitIntervalInMs\":0.0195,\"availableThreads\":32766,\"minThreads\":8,\"maxThreads\":32767}},{\"dateUtc\":\"2022-07-14T19:16:12.3148157Z\",\"cpu\":100.000,\"memory\":213544.000,\"threadInfo\":{\"isThreadStarving\":\"False\",\"threadWaitIntervalInMs\":0.0142,\"availableThreads\":32766,\"minThreads\":8,\"maxThreads\":32767}},{\"dateUtc\":\"2022-07-14T19:16:22.3187880Z\",\"cpu\":100.000,\"memory\":141324.000,\"threadInfo\":{\"isThreadStarving\":\"False\",\"threadWaitIntervalInMs\":0.0391,\"availableThreads\":32755,\"minThreads\":8,\"maxThreads\":32767}},{\"dateUtc\":\"2022-07-14T19:16:32.3371596Z\",\"cpu\":100.000,\"memory\":165764.000,\"threadInfo\":{\"isThreadStarving\":\"False\",\"threadWaitIntervalInMs\":0.0084,\"availableThreads\":32766,\"minThreads\":8,\"maxThreads\":32767}}]}\r\n, SDK: Microsoft.Azure.Documents.Common/2.14.0, StatusCode: RequestTimeout",
      "exceptionResponseHeaders": "{Transfer-Encoding=chunked, Server=Microsoft-HTTPAPI/2.0, Access-Control-Allow-Origin=, Access-Control-Allow-Credentials=true, x-ms-gatewayversion=version=2.14.0, x-ms-request-validation-failure=1, Date=Thu, 14 Jul 2022 19:16:48 GMT, x-ms-activity-id=4b389ccb-eb28-4359-a266-43542380743d, Content-Type=application/json}"
    },
    "systemInformation": {
      "usedMemory": "137049 KB",
      "availableMemory": "7092391 KB",
      "systemCpuLoad": "(2022-07-14T19:16:40.317Z 17.4%), (2022-07-14T19:16:45.317Z 5.2%)",
      "availableProcessors": 8
    },
    "clientCfgs": {
      "id": 1,
      "machineId": "uuid:4d84cf5e-3796-4b2e-b86d-e050e75cbb01",
      "connectionMode": "GATEWAY",
      "numberOfClients": 1,
      "connCfg": {
        "rntbd": null,
        "gw": "(cps:1000, nrto:PT1M, icto:PT1M, p:false)",
        "other": "(ed: true, cs: true)"
      },
      "consistencyCfg": "(consistency: Session, mm: true, prgns: [])"
    }
  }
} 
Aquaelia
  • 73
  • 1
  • 6
  • Can you share examples of the errors you are experiencing? Have you already went through the official timeout troubleshooting: https://learn.microsoft.com/en-us/azure/cosmos-db/sql/troubleshoot-request-timeout-java-sdk-v4-sql? – Matias Quaranta Jul 14 '22 at 17:31
  • Edited topic, add cosmos diagnostics. This exception is always throwed when I try to upsert 2000 documents to container. Probably reason could be in some emulator restrictions that not specified in documentation. Similar topic: https://github.com/Azure/azure-cosmos-dotnet-v2/issues/669 – Aquaelia Jul 14 '22 at 19:29
  • I think the issue is high CPU utilization. Looking at the error text your CPU appears to be at 100%. More details here, https://learn.microsoft.com/en-us/azure/cosmos-db/sql/troubleshoot-dot-net-sdk-request-timeout?tabs=cpu-new#high-cpu-utilization – Mark Brown Jul 14 '22 at 22:00
  • Where are you running this? I've found that I don't get this problem when I run locally, but concurrent requests on the Azure hosted pipeline agents regularly return 408s and sometimes 503s – Aaron Zhong Nov 18 '22 at 05:00

1 Answers1

2

The timeout you shared is a failure on the Emulator. The Emulator consumes resources from the machine that runs it, it is not meant to do performance benchmarking (any result you obtain would be unrealistic because you will never connect to the Emulator on a production scenario).

The error is showing that the machine running the Emulator is constrained in resources (CPU) and that is the source of the timeouts.

The error property contains the response from the backend (the Emulator in this case) which contains the CPU information from the machine running the Emulator:

{"systemHistory":
    [
        {"dateUtc":"2022-07-14T19:15:42.2854148Z","cpu":100.000,"memory":278016.000,<..other data...>
        {"dateUtc":"2022-07-14T19:15:52.2978001Z","cpu":100.000,"memory":278792.000,<..other data...>
        {"dateUtc":"2022-07-14T19:16:02.2997627Z","cpu":100.000,"memory":277472.000,<..other data...>
        {"dateUtc":"2022-07-14T19:16:12.3148157Z","cpu":100.000,"memory":213544.000,<..other data...>
        {"dateUtc":"2022-07-14T19:16:22.3187880Z","cpu":100.000,"memory":141324.000,<..other data...>
        {"dateUtc":"2022-07-14T19:16:32.3371596Z","cpu":100.000,"memory":165764.000,<..other data...>
    ]
}

CPU on that machine is at 100%.

Matias Quaranta
  • 13,907
  • 1
  • 22
  • 47
  • Ok, understood. I try to allocate more resources for emulator but it does not help. The code where issue reproduced use parallel operation and blockLast in the end. Probably it requires a lot of resources. So currently we use azure instance for test pipelines and we try switch to the emulator, but just for now it is not working stable. – Aquaelia Jul 18 '22 at 08:45
  • Returning to this question because it still not resolved. Could you please tell how many request can emulator proceed in one minute? Maybe I am able to do some delays between db calls to prevent emulator overload. – Aquaelia Oct 18 '22 at 16:38
  • It depends on your machine. It's your machine running the Emulator, it's your machine's resources being consumed. – Matias Quaranta Oct 18 '22 at 16:55
  • So I using docker linux emulator with default config --memory 4g --cpus=2.0. I tried to allocate more resources(6,8 cpu cores) but seems nothing changed. It often throw 503 service unavailable error because of overload. Could you please suggest how to handle this error? I guess the windows emulator version works better but we don't have possibility to use it now. – Aquaelia Oct 19 '22 at 16:51
  • I honestly cannot give any suggestion, I don't work on the Emulator nor I know if there are any differences about linux or windows flavors. Remember the emulator runs in your hardware, how much it can process depends on the available resources. If CPU is shooting up, you need to figure out what is driving that, maybe your machine cannot simply process that amount of operations with the emulator. – Matias Quaranta Oct 19 '22 at 17:28
  • @Aquaelia I think the machine resources do play a part in it, but I've also noticed that tweaking the emulator configuration will reduce the frequency of the errors occurring, specifically increasing the partition count and container throughput. See: https://github.com/Azure/azure-cosmos-dotnet-v2/issues/817#issuecomment-927133201 – Aaron Zhong Nov 18 '22 at 05:16