Sitecore Troubleshooting 101: How to Get Out of Trouble
Ever struggle with issues while working in the Sitecore Platform and don’t know where to start? Who hasn’t!… but you will be surprised to learn that most issues are not Sitecore itself. In fact, during my tenure I have found that 60-80% of the issues in a Sitecore environment are code, content, or config based where the others may be related to the environment setup and networking… then finally perhaps the need for a Sitecore Support Ticket.
In this post we will review some of the common troubleshooting scenarios, laugh/cry/hide from the weird things I have seen in the field, evaluate using a 12 Step Runbook for troubleshooting, consider how to prevent issues where possible, and look at some helpful tools/methods for keeping a Sitecore environment running strong.
Contents
Common Sitecore Troubleshooting Scenarios
Here are some common troubleshooting scenarios and what caused them. This is not a complete list of scenarios and causes, but may give you a clue of where to start looking.
Slow loading on startup: Improve Sitecore cache tuning and use a warmup script
Slow page load: Check for heavy or bad code, bots, low resources, Sitecore patches that need to be applied, cache evictions, session state settings/backend, marketing campaigns running
Page won’t load: Check the network, servers, and logs in the environment (infrastructure, networking, and Sitecore logs)
Assets on page won’t load when using CDN: Check if the cache is
clearing and review the CDN configuration
Custom Solr cores not rebuilding: Is your Sitecore configuration incorrect, are you using rebuild cores, check the health of SolrCloud and the Zookeepers
High AcquireRequestState: Check your Session State provider and tune as appropriate
NULL errors: Review your templates to ensure they are published, missing items, or a bad code release
Intermittent Solr results: Are you using SwitchonRebuild, Primary/Secondary
nodes instead of SolrCloud
Intermittent errors depending on CD: Do you have sticky sessions enabled and is the same code base on all your CDs?
CMS publishing slow: Are you publishing multiple languages, using the Publishing Service, have workflows and publishing governance to manage the amount that gets published at a time?
xConnect errors and no analytics: Any networking/connectivity issue and have you checked the SSL Certificates
Composable components failing: Are you experiencing networking issues, checked the logs on the composable component, and issues with the connector(s) or licensing
Weird Things From The Field
Beyond the common comes the uncommon! Here are some of the most interesting troubleshooting issues and their root cause that I have come across.
Intermittent failures for some users in PaaS Web Apps: Redeploy as Azure did not properly apply code across instances and sticky sessions were in use
Bot like traffic: Client was DDoSing themselves by having hacking reward competitions
Site showing no traffic but visiting the site was horrible response
time: Site was redirecting to an old site
Site “magically showing errors” without “any changes”: Client had a teammate deploying code, but told no one that they were
CD not loading despite load tested and fine in QA: Deployed the CMS role over the CD
Unexplained load every Monday on the CD: Someone enabled the Sitecore Admin login on the CD Server instead of using the CMS, so editors were using the CD
Site stopping every night but then coming back in the morning:
Someone scheduled an Azure Automation job to shutdown Dev each night, but ran it against Prod instead
Solr rebuilt but old data on site depending on user: CDs hit load balancer in front of Primary/Secondary Solr environment and when hitting secondary it was not looking at the rebuilt index due to Solr limitation. SolrCloud resolved this
Site showing errors using Sitecore Commerce: Client deployed wrong version of Commerce engine and used custom DLLs
The Sitecore 12 Step Program to Recovery/Runbook)
When issues like the above two sections strike, its imperative to follow a runbook to quickly identify and resolve the issues. Here are 12 steps to consider when troubleshooting a Sitecore environment.
1.What changed recently?
2.Is there a clear error code on the site (whether infrastructure or Sitecore)?
3.Was there a recent deployment, networking change, large publishing, or spike in traffic?
4.What is in the APM tool and Sitecore logs?
5.How is the performance of the front-end servers or Web Apps?
6.How is the performance of the Application servers, Web Apps, or Composable components (ex. Sitecore Search, Sitecore Personalize, xConnect/Reporting tied to the front end)?
7.How is the performance of the sessionState provider and is it properly configured?
8.How is the performance of the database backend?
9.How is the performance of Solr or the search provider and is it properly configured?
10.How is the performance of the networking/edge devices (Outages or certificates expired/matching thumbprints)?
11.Is there any unexpected or unwanted traffic?
12.Are you sure there were no recent deployments or large publishing? If so, does this require a Sitecore Inc. Support review for potential hotfixes?
Preventing Problems Before They Happen
Here are some helpful tips to prevent problems before they happen. This is not a comprehensive list, but should get you thinking about point to point governance and DevOps to identify and solve problems before they end up in Production.
Use Source control per Sitecore Role (Can use Includes targeting a role require, which may be actually preferred)
Establish Clear DevOps and Governance (also prepares you for AKS/PaaS/SaaS)
Load test, load test, load test in a lower environment of like size and features
Use a WAF (Web Application Firewall) and CDN
Implement publishing governance and workflows
Got a bad deployment, roll back and don’t try to fix it in Production (this requires a rollback plan)
Properly scale your infrastructure and edge devices
Check your certificates and any third-party integrations
Don’t overwrite databases but use content syncing
Soft launch and load test Production before going live for the first time
Adhere to Sitecore best practices with proper configuration instead of getting “creative”
Use include files and do not overwrite default DLLs
Use linear Stage Gates in your DevOps to check before promoting
Use Automated and manual QA and clearly defined UAT scope
Get a DR environment/plan and test failovers on a regular basis
Don’t deploy on Friday!
Favorite Tools and Methods
Over the years, I have found some helpful tools/methods and hope these may help you in your Sitecore management and troubleshooting needs.
Follow the traffic path method when troubleshooting (start evaluating an issue from the front/edge devices and then proceed to backend applications and database layers)
Built In Sitecore Admin Tools available via the CMS
Datadog, New Relic, Azure Application Insights APM (not the logging), or comparable APM
Load testing tools with intelligence (ex. BlazeMeter, Loadster, SmartBear, etc.)
GTMetrix, PageSpeed Insights, and Google Lighthouse
Azure DevOps, Datadog, and Custom Azure Dashboards
Content Sync/DB Compare (Sidekick, Razl, Unicorn Sync, OpenDBDiff, etc.)
Sitecore community/experts
Sitecore KB (Search for Known Issues by Sitecore Version: https://support.sitecore.com/kb
All else failing, Sitecore Support ticket