« Back to Index
[High Availability Content and Resilient Systems with Redundancy]
View original Gist on GitHub
Tags: #availability #resilience #tolerance #redundancy
High Availability Content and Resilient Systems with Redundancy.md
- Encourage greater use of existing shared software libraries (e.g. those that implement ‘graceful shutdown’ for in-flight requests).
- Create shared libs supporting resilience patterns (e.g. circuit breakers, back-off, retries).
- Monitor dependencies and identify new releases, notifying service owners/channels (allow frequency control).
- More flexible canary deployments and introduce blue-green deployments.
- Utilize tools such as Chaos Monkey or Gremlin.
- Improve our synthetic testing (e.g. smoke testing).
- Better service isolation (e.g. prevent problems propagating to other parts of our system).
- Create feature flags to help disable broken features within a service quickly.
- Automatic failover to multiple regions † (remove the manual process for app-west).
- Implement some form of ‘adaptive capacity’ adjustment (software and infrastructure).
- Prioritized load shedding (effective caching might be simpler/easier).
- Setup ‘traffic mirroring’ (which can help verify service performance for dark launches).
† Software libraries might need updating to reflect the dynamic nature of the regions they’re interacting with.