Scientific and Safe Chaos Engineering Brian Wilcox
2020-03-01 143浏览
- 1.
- 2.
- 3.Steady State
- 4.Stressed Event Deviation
- 5.SEO “Reset” Requests Cache Purge Hardware Failure
- 6.(Amplitude) Resilience Reliability (Frequency)
- 7.Assertion Change Validation Operation
- 8.Resilience Engineering Attempts to improve how a system reacts to a stressed state. Chaos Engineering Attempts to prove how a system reacts to a stressed state. Photo by Michael Fenton on Unsplash
- 9.What are you trying to s/(im)?prove/ Steady state only matters if you can define what is good vs bad. 1. Operability 2. User Interactions 3. Durability Releases Dependency Tree • Standard deployments • Compatibility problems • Unrealized dependencies • Slow pipelines == bad app Regional Error Rates • Noisy Dependencies • Noisy Operation • Invalid Input Supportability/Deprecation Availability • Unrealized dependencies • Graceful Degradation Consistency • Consensus isn’t free • Natural Disasters • Governmental Instability • Hardware Failure Tooling • Metrics and Alerting Pipeline • AutoRemediation Tools • Fault Detection
- 10.The Process • • • • • • • Establish Steady State Observe bad results Observe good results Consider the control planes Plan for safety Execute the test Record, correct, repeat Photo by SpaceX on Unsplash
- 11.Steady State • • • • Availability Number of Units Shipped Rate of Failure Meters under Water Photo by Miguel A. Amutio on Unsplash
- 12.The Bad • What do failures look like? • What are the common categories? • Do you have a definition of bad? • Site Issues? Helpdesk tickets? Photo by Hayden Walker on Unsplash
- 13.The Good • Definition of Good (SLA) • IR&M Processes • How often does the system take care of itself? • Lineage-driven fault injection
- 14.Control Plane • When can your applications make routing decisions? • How long can you hold on to a request? • How do applications report stress? • Where can you affect change? • Development lifecycle • Operational burden • Control of resources Photo by Crew on Unsplash
- 15.Photo by Pop & Zebra on Unsplash Safety Third Minimize the blast radius Have a backup/rollback plan Assume missing information What’s the least you have to do to complete the experiment? • Build confidence • • • •
- 16.Assertion Change Validation Operation
- 17.
- 18.Questions?