Postmortem report of BH site outage on 3/2

Summary: Site outage of Behavioral Health website due to a faulty server node allocated to Rethink BH app plan    

Status: The issue is currently resolved

Description: Detailed timeline of the incident and its context below

Root causes: Abnormally high number of Redis cache hits (from the usual 100K hits to over 2M) causes Rethink Behavioral Health site outage. The root cause is determined to be a faulty MS Azure server added to Behavior Health app, taking in most of the site traffic, generating high CPU and high Redis cache hits.

Impact: All BH customers were affected

Mitigation: To bypass faulty node allocation, new web app plan and app service are created, tested and launched. In addition, to avoid Redis Cache being the single point of failure, additional safeguards are built in including 1) moving a large amount of heavily-accessed data into application in-memory caching to reduce the load on the distributed Redis cache, and 2) enhanced fallback mechanism to retrieve data from our back-end data store in case of Redis connection problems

Takeaways: Tangible actions to be taken going forward include

Short term:

  1. Additional monitoring to detect and alert of faulty nodes added to the web app plan. Goal is to proactively catch and address the issue

  2. Add failover mechanism for Redis Cache service, to ensure that degradation in app service due to Redis will trigger a failover to West coast while the issue is being addressed on East coast

Long term:

  1. Discover any additional single points of failure, reduce dependency and add fallbacks platform-wide