Scientific and Safe Chaos Engineering Brian Wilcox

2020-03-01 143浏览

  • 1.
  • 2.
  • 3.Steady State
  • 4.Stressed Event Deviation
  • 5.SEO “Reset” Requests Cache Purge Hardware Failure
  • 6.(Amplitude) Resilience Reliability (Frequency)
  • 7.Assertion Change Validation Operation
  • 8.Resilience Engineering Attempts to improve how a system reacts to a stressed state. Chaos Engineering Attempts to prove how a system reacts to a stressed state. Photo by Michael Fenton on Unsplash
  • 9.What are you trying to s/(im)?prove/ Steady state only matters if you can define what is good vs bad. 1. Operability 2. User Interactions 3. Durability Releases Dependency Tree • Standard deployments • Compatibility problems • Unrealized dependencies • Slow pipelines == bad app Regional Error Rates • Noisy Dependencies • Noisy Operation • Invalid Input Supportability/Deprecation Availability • Unrealized dependencies • Graceful Degradation Consistency • Consensus isn’t free • Natural Disasters • Governmental Instability • Hardware Failure Tooling • Metrics and Alerting Pipeline • AutoRemediation Tools • Fault Detection
  • 10.The Process • • • • • • • Establish Steady State Observe bad results Observe good results Consider the control planes Plan for safety Execute the test Record, correct, repeat Photo by SpaceX on Unsplash
  • 11.Steady State • • • • Availability Number of Units Shipped Rate of Failure Meters under Water Photo by Miguel A. Amutio on Unsplash
  • 12.The Bad • What do failures look like? • What are the common categories? • Do you have a definition of bad? • Site Issues? Helpdesk tickets? Photo by Hayden Walker on Unsplash
  • 13.The Good • Definition of Good (SLA) • IR&M Processes • How often does the system take care of itself? • Lineage-driven fault injection
  • 14.Control Plane • When can your applications make routing decisions? • How long can you hold on to a request? • How do applications report stress? • Where can you affect change? • Development lifecycle • Operational burden • Control of resources Photo by Crew on Unsplash
  • 15.Photo by Pop & Zebra on Unsplash Safety Third Minimize the blast radius Have a backup/rollback plan Assume missing information What’s the least you have to do to complete the experiment? • Build confidence • • • •
  • 16.Assertion Change Validation Operation
  • 17.
  • 18.Questions?