SNAT Port Exhaustion in Sitecore Azure Paas Instances
When leveraging Deployment Slots as laid out in the Multisite Application Initialization for Zero Downtime Azure Deployments post, an issue kept arising on swap when we would see unhealthy instances and 500 errors. Digging into this, it was found that it was not swapping in bad or un-warmed up instances, rather it was experiencing SNAT Port exhaustion post swap.
Contents
What is SNAT Port Exhaustion
Per Microsoft:
Source network address translation (SNAT) is used to translate a virtual machine’s private IP into the load balancer’s public IP address. SNAT maps the IP address of the backend to the public IP address of your load balancer. SNAT prevents outside sources from having a direct address to the backend instances.
Source Network Address Translation (SNAT) for outbound connections – Azure Load Balancer | Microsoft Learn
Note that Microsoft treats Application Service instances like virtual machines when it comes to SNAT.
By default, in an Azure App Service, there is a shared tenant limit of 128 SNAT Ports, so if you are performing an intensive operation such as starting up Sitecore or Swapping in multiple instances, you could run into SNAT Port Exhaustion.
Fortunately, there are few recommendations to handle this scenario.
How to Handle SNAT Port Exhaustion in an Sitecore Azure PaaS Application
Microsoft Azure offers 3 distinct options for managing SNAT Port Exhaustion/the 128 SNAT Port limit. Which one you choose depends on your use case. In mine, there are several third-party applications that reside outside of Azure, so a NAT Gateway with 64,000 SNAT Ports was the best option. Evaluate the following Pros/Cons to determine and implement which solution is right for you.
IMPORTANT: The following is taken from Azure’s Diagnose and Solve Problems for an App Service experiencing SNAT Port Exhaustion
Option 1
Regional VNET integration with service/private endpoints.
If your destination is an Azure service that supports service endpoints, you can avoid SNAT port exhaustion issues by using Regional VNet Integration and service endpoints or private endpoints.
When you use Regional VNet Integration and place service endpoints on the integration subnet, your app outbound traffic to those services will not have outbound SNAT port restrictions. Likewise, if you use Regional VNet Integration and private endpoints, you will not have any outbound SNAT port issues to that destination.
If your site is already in a regional VNET, you do not need to change its configuration. Continue with the next step and configure service/private endpoints for the destination endpoint.
- Regional VNet Integration
- Troubleshooting intermittent outbound connection errors in Azure App Service
Pros
- The application will no longer be restricted by SNAT limitations.
- No code changes or application redeployment required.
- Low cost, low maintenance configuration that can provide quick mitigation while you evaluate code changes.
Cons
- All the endpoints the app connects to must be hosted on Azure.
- Additional configuration is required on dependent services to enable service endpoints.
Option 2
VNET integration with NAT Gateway.
If your destination is hosted outside Azure, you can avoid SNAT port exhaustion issues by using VNet Integration and routing the traffic through a NAT gateway.
As the traffic is routed through VNET, it is not subjected to SNAT limits. Outbound calls to the internet endpoint is made from the NAT gateway, at this point, the connection will use one of the 64K SNAT ports pre-allocated to the NAT gateway.
All the applications that are a part of this VNET and are routing traffic through NAT gateway will share these 64K ports. This gives you a lot more than the default 128 pre-allocated SNAT ports when running on app services without this setup.
If your site is already in a VNET, you do not need to change its VNET configuration. Continue with the next step to setup and configure a NAT gateway and route traffic through it.
- Regional VNet Integration
- Tutorial: Create a NAT gateway using the Azure portal
- NAT Gateway and app integration
- Troubleshooting intermittent outbound connection errors in Azure App Service
Note: If the destination endpoint resolves to a public IP address, you may need to configure the app setting WEBSITE_VNET_ROUTE_ALL=1
on your site to force all traffic through VNET, and hence, through the NAT gateway.
Pros
- Available SNAT ports are much more than the default 128.
- No code changes or application redeployment required.
- This approach works even if the application is connecting to endpoints hosted outside Azure.
Cons
- Additional cost for VNET and NAT gateway. Consider this an interim mitigation while you evaluate code changes for a more long term fix.
- Additional configuration is required outside of Azure App Services.
Option 3
Modify application code to avoid SNAT issues.
SNAT port exhaustion is usually a result of when the application is not reusing existing connections. Creating an outbound connection per request is bad practice and can lead to snat port exhaustion, especially if the backend is responding slow. Please check if the application code is following best practices and using connection pooling.
Most of the libraries support connection pooling so there shouldn’t ideally be a need to create a new outbound connection per request.
For C# applications, ensure that the HttpClient
object is created once and resued again. For ADO.NET connections, make sure database connection pooling is used. You may restart the web app in order to reset this limit.
It is recommended to pursue a long term and stable fix by following connection pooling best practices.
- Improper Instantiation antipattern (Code sample included)
- Call a Web API From a .NET Client (C#)
- SQL Server Connection Pooling (ADO.NET)
- Modify the application to use connection pooling
- Set Limit on Outgoing Connections from HttpClient (.NET Core or .NET Framework)
Pros
- Refactoring the code can lead to overall improved application stability, reliability and efficiency.
- The application may be able to support higher load post code refactoring.
- No additional associated cost.
Cons
- May need time to refactor the code and can be a significant investment.
great writing, in .Net (non-core) it is just easier to use the config file for maxconnections (outbound only applied, raise it more than 2, maybe even 200 if CPU is allowed). ServicePointManager.DefaultConnectionLimit is not bad but that means touching the source code and/or recompile which is not always an open option