[High Availability Content and Resilient Systems with Redundancy]

Tags: #availability #resilience #tolerance #redundancy

High Availability Content and Resilient Systems with Redundancy.md

Encourage greater use of existing shared software libraries (e.g. those that implement ‘graceful shutdown’ for in-flight requests).
Create shared libs supporting resilience patterns (e.g. circuit breakers, back-off, retries).
Monitor dependencies and identify new releases, notifying service owners/channels (allow frequency control).
More flexible canary deployments and introduce blue-green deployments.
Utilize tools such as Chaos Monkey or Gremlin.
Improve our synthetic testing (e.g. smoke testing).
Better service isolation (e.g. prevent problems propagating to other parts of our system).
Create feature flags to help disable broken features within a service quickly.
Automatic failover to multiple regions † (remove the manual process for app-west).
Implement some form of ‘adaptive capacity’ adjustment (software and infrastructure).
Prioritized load shedding (effective caching might be simpler/easier).
Setup ‘traffic mirroring’ (which can help verify service performance for dark launches).

† Software libraries might need updating to reflect the dynamic nature of the regions they’re interacting with.