March 5, 2021 / Sitecore

11 Tips for a More Stable Sitecore PaaS Environment

For those running Sitecore PaaS with Application Services, there are some layers of abstraction that are critical for the stability of your environment. While PaaS seems to be getting replaced by Containers, it is still here and it may be years before you migrate to a new infrastructure platform. Here are 11 tips from the field for a more stable Sitecore PaaS environment:

Contents

1 Disable ARR Affinity (Sticky Sessions)
2 Disable Proactive Auto Heal
3 Enable Health Check
4 Use Application Initialization and Carefully Evaluate Autoscaling
5 Ensure Redis is Properly Scaled (No C1 in Production)
6 Use Elastic Pools
7 Use Local Cache
8 Load Test Prior to Go Live and Prior to Releases
9 Tune Your Sitecore Caches
10 Regularly review App Service Plans, Azure SQL, and Redis Usage
11 Get a WAF and a CDN

Disable ARR Affinity (Sticky Sessions)

If you have multiple instances in an App Service, ARR Affinity will keep users on an instance. This is a problem as you are not evenly distributing the load and could have instances go down as a result (more on instance issues later). In 99% of the cases, unless you have some sort of weird cookie based flow (which if you do, reconsider your data capturing) disable ARR Affinity.

For CDs, determine the correct Session State configuration as well: https://doc.sitecore.com/developers/90/platform-administration-and-architecture/en/scaling-and-configuring-session-state.html

Disable Proactive Auto Heal

The proactive auto heal feature will restart an instance if an App Service hits above 90% of its memory for 30 seconds. For Sitecore apps that may have a long running process or be close to threshold, this is not desirable and may lead to intermittent crashing. Strongly consider disabling proactive auto heal.

Enable Health Check

Health Check is a nice feature within the App Service that allows you to configure a page to ping (recommend the keepalive.aspx page) and if it shows unhealthy for a set amount of time (up to 10 minutes), it will replace the instance with a new one.

Important, if your issues are code related, this will not fix your site. Lastly, and MOST IMPORANT, do not enable health check unless you have Application Initialization in place… more on that in a moment.

Learn more about how to configure health check in an App Service.

Use Application Initialization and Carefully Evaluate Autoscaling

Application Initialization is a very helpful feature that can speed the initial load of your Sitecore application and offers a remap function. This allows you to place a temporary .htm page for user interaction while Sitecore loads in the background. This is critical as Azure currently places instances into rotation without a mechanism to wait until Sitecore is loaded.

Without application initialization, your users will have a horrible experience of 500 errors until Sitecore is loaded. Using the remap page, while not a great experience, is still better than error pages while new instances are waiting for Sitecore to load.

More about this can be found in my post entitled Autofail: A Big Azure Autoscale Limitation and What To Do

Ensure Redis is Properly Scaled (No C1 in Production)

Redis is critical component to your session state flow and management. While you may look at a C1 Redis and believe because it has plenty of memory or cache space you are performant… there is a item often missed. C1 Redis only offers low network bandwidth!

This oft missed aspect is a major bottleneck, so please consider at least a C2 with moderate network bandwith or greater for your Production environments.

Details per the Redis Cache can be found here.

Use Elastic Pools

Elastic Pools allow you greater flexibility and balance of cost per power for your Azure SQL databases. Please note that certain databases such as shards with their high I/O may not be as cost effective to have in an elastic pool as you would need to make the entire pool Premium tier. The shared commerce database also tends to respond better to as a standalone Gen5 type database vs. DTUs.

Per grouping databases, you may want to group XM databases in one pool and XP in another. You can further split this out if you use a subset of XP features like Cortex. Confused about which databases are XM, XP, XC, etc? Check out this post on the Sitecore 10 Application Roles, Storage Roles, and Indexes.

This article offers a great overview on elastic pools.

Use Local Cache

Microsoft can perform platform updates/upgrades of the Azure platform at any time… without notification. To help protect against this, you can enable Local Cache, which caches your site. It won’t guarantee your site won’t go down, but it will give you a fighting chance.

Learn more about setting up Local Cache here.

Load Test Prior to Go Live and Prior to Releases

Those who load test are less likely to have unplanned outages and long response times. Load testing prior to go live in your Production environment, and using the results to refine your code/configs or upscale infrastructure, is critical to your stability.

For subsequent releases, you should not load test a Production environment receiving traffic, but you could upscale UAT to match Production so long as the code/configs are the same to identify challenges before they become issues. Post load test, you can scale UAT back down until the next test.

Tune Your Sitecore Caches

Your Sitecore caches are a critical component for your site performance. Routinely check the logs on the CD App Service for Cache Evictions and tune your Sitecore caches accordingly.

Regularly review App Service Plans, Azure SQL, and Redis Usage

Your custom solution residing on top of Sitecore is part of what drives the load on your infrastructure. The other part is the traffic that flows through it. Anytime you have a change in either code/config via a release… or traffic pattern changes, you should evaluate the App Service Plans, Azure SQL, and Redis for any load differences.

If they are running too high (look for both averages and peaks), you may need to evaluate what changes your code introduced, and if the code is QC pass, scale the appropriate infrastructure. From a pattern analysis perspective, your 24 hour view of a CD App Service CPU should look like a “sleeping dragon” with a long tail (low traffic off hours), a hump (traffic peak), and a long neck (traffic coming off peak).

If the pattern is pegged high, or your traffic pattern is deviating from the norm, dig into the sources (code/traffic) and either remediate code/configs or scale. Of note, depending on your industry and traffic pattern you may not have a “sleeping dragon” so compare against a month to determine the pattern your CD App Service follows.

Get a WAF and a CDN

A Web Application Firewall cannot be understated. I cannot count the amount of sites taken down by bots or unwanted traffic. A bot may not necessarily be bad either, but could be just too aggressive. In any case, any Production environment must be protected by a WAF to filter out attacks, unwanted traffic, and only allow valid users.

Of note, many WAFs are combo products that include a Content Delivery Network to offload assets. If your WAF is not a combo product, there are a bevy of great CDNs out there for you to configure Sitecore to work with.