2

We have several APIs running as Azure App Services and have been experiencing intermittent service outages where 10-25% of the calls to those App Services return a 503.64 while the other calls (often to the same endpoint) return normal responses. None of the 503 failing requests appear in App Insights or in the App Service's web server logs, and we can only see them through tracing and logging from the calling services (App Gateway logging or other App Services making internal REST calls).

We haven't been able to determine a pattern to when this problem occurs. Between disruptions, we have seen as little as an hour and as much as 2-3 days. The issue has lasted anywhere from 2-30 minutes at a time. There is no correlation with traffic or system load, and it happens at any time of day or night. We have seen this on both I1 and I2 tier plans. We have tried scaling out our services well beyond necessary capacity, but the problem still occurred.

Looking at the 64 sub-status code, it suggests a rewrite issue.

    <!-- Antares Rewrite provider codes-->
    <error id ="64" description="Exception in rewrite provider (probably SQL)" />

However, I'm confused by the fact that some of the rewritten calls succeed during the same period that others fail.

For completeness, here is the rewrite for one of the failing services. A public API endpoint is being rewritten to an internal request address, and all other APIs are to remain as-is and pass through to the Controller:

<rewrite>
    <rules>
        <rule name="LegacyGetStatus">
            <match url="Mapping/fw/1.0/Status(.*)" />
            <action type="Rewrite" url="api/Status{R:1}" />
        </rule>
    </rules>
</rewrite>

We submitted a support ticket a week ago, and in the meantime we are trying to figure out what could be causing this and what we can do to mitigate the issue.

Origameg
  • 324
  • 2
  • 9
  • We have removed our custom rewrite rules completely from one of our services, and the problem still occurred. I have also received some information from Azure support that a 503.64 indicates "SiteNotAvailable. Returned when we couldnt find an entry in main/backup hostname lookup cache for some reason." We are curious what some reasons are. – Origameg Aug 25 '22 at 17:30

1 Answers1

0

The 503.64 response seems to have been a red herring. It appears that there was required outbound network traffic that was being blocked by our network firewall. The advice from Azure was to double check the settings and allowed traffic according to the Azure documentation or to migrate from App Service Environment v2 to v3 where this traffic appears to be allowed automatically (or not needed).

Origameg
  • 324
  • 2
  • 9