LinkedIn 夏婧姝 - 《应用实时线上流量进行自动化容量测量与性能瓶颈分析》

2020-02-27 65浏览

  • 1.Detecting Capacity Limits and Performance Bottlenecks Using Live Traffic ​Susie Xia ​Jeff Weiner​Christopher Coleman ​Chief Executive Officer ​2018 QCon Beijing
  • 2.Agenda 1 Introduction 2 Meet Redliner 3 Use Cases 4 Future Plans
  • 3.LinkedIn Engagement & Growth 546M Members • 20M Companies • 14M+ Open Jobs • 29K+ Schools • 11B+ Endorsements 20% Sessions Growth(YoY) 200+ Countries & Territories • 5th straight quarter of this growth • Record levels of engagement • 60% (YoY) growth in viral actions, such as likes, comments, shares, and messages sent • Available in 24 languages • 70% members outside of US • > 2+ new users join per second
  • 4.Our Dilemma WHY IS SERVER GROWTH OUTPACING PAGE VIEW GROWTH?
  • 5.Over Provisioning 31% Wasted in 2016 • Organic Growth • Unexpected Events • New Products & Features • Emergency Uplifts
  • 6.Motivations • Resource Efficiently • Capacity Plan Effortlessly • Increase Throughput Reliably
  • 7.Challenges • External Interferences • Evolving Product Landscapes • Complex Downstream Dynamics
  • 8.Load Testing Journey Synthetic Synthetic Load in Lab Load in Prod Isolated Host Record & Replay Anything Else? Learnings • + NRCeoanIlmtirsoptilaclecIEdtnnofEvrnanirsvPotirnoumdcnteumuncreteinotn • - HRIneicfgqrohaunlsyistrreiCsusutceCstntuotrsemRtoNeimzsoeutdTlRteSsesepttrSuecpsreipnttsative • - HInaicgrohdnOtsoivseMStrecahanienlteatTda(rHia&nifgfFMihcualOPilnrCpotoefrivnlaeatrsniaocgneeal Cost)
  • 9.Goals • Use Live Production Traffic • Minimize Impact to Users • Require Low Operational Overhead
  • 10.Hello, Redliner
  • 11.Workflow Traffic Shift Request Live Production Traffic Load Balancer Redliner App Instance App Instance App Instance App Instance Health Check Request PASS / FAIL Service Health Evaluator • Errors & Error Rates • Latency Percentiles • System Stats Metric Collection Framework
  • 12.Health Evaluations • Variety of health checks measured every set interval • Evaluations at the host, cluster, and data center levels • Incorporates signal from operational alerting system • Performance comparisons between target and the cluster
  • 13.Health Checks
  • 14.Dynamic Ramping Slow, Steady Ramp Fast, Aggressive Ramp
  • 15.Complete Automation • Manipulation of traffic between nodes in the cluster • Determination of the node’s and service’s health • Identification of potential bottlenecks under stress • Remediation of any issues encountered during test
  • 16.Use Cases
  • 17.1. Find Single Instance Max Throughput • Gradually stresses the service until it cannot safely handle any additional load • Simplifies resource provisioning • Provides starting point for tuning and optimizations
  • 18.2. Improve Service Throughput • Investigate health check failures from increased traffic • Discover APIs “A”, “B”, “C” error rates jumped • Caused API “D” latency to double • Resolve issues one by one • Repeat the Redliner test
  • 19.Before Investigation After Investigation
  • 20.3. Detect and Diagnose Regressions Test Id Test 1 Test 2 Date 2017-11-19 09:01:11 2017-11-19 23:58:09 Version v1.0.0 v1.0.1 Redline 2536.33 534.19 Health Check Failures in Latency • N/A • EndpointA:Median latency exceeded 20% change in comparison to control target. • EndpointB:Median latency exceeded 20% change in comparison to control target.
  • 21.The Smiley Curve
  • 22.Live Requests from Service Clients Proxy / Load Balancer 4. A/B Load Testing Production v1.0.0 Service Instance Service Instance Service Instance Canary v1.0.1 Service Instance • Run Redliner test side-by-side on canary and production versions • Code comparisons • Configuration comparisons • OS comparisons • Security updates
  • 23.A/B Load Test Example • Same load on both canary and prod instances until one or both failed health check • Prod instance hits health check failure before canary instance • v1.0.1 on canary has better throughput – new version is encouraged to be deployed
  • 24.5. Identify Surplus Capacity When ?????????????????????????????????????????? ?????????????????? < ?????????????????????????????? ?????????????????????????????????????????? ??????????????????, ?????????????????????????????? ?????????????????????????????????????????? ?????????????????? ?????????????????????????????? # ???????????? ?????????????????????????????????????????????????????? = ?????????????????????????????????????????? ?????????????????? + ???????????????????????????????????????????????? When ?????????????????????????????????????????? ?????????????????? ≥ ?????????????????????????????? ?????????????????????????????????????????? ??????????????????, ?????????????????????????????? # ???????????? ?????????????????????????????????????????????????????? = 1 + ???????????????????????????????????????????????? ???????????????????????????????????????????????? If ?????????????????????????????? # ???????????? ???????????????????????????????????????????????? ?????????????????????????????????????????????????????? > ?????????????????????????????? # ???????????? ??????????????????????????????????????????????????????, the service is over-provisioned.
  • 25.Server Cap Ex Trend for Service
  • 26.Future Work
  • 27.1. Dynamic Provisioning • Auto Scaling – Scale predictably to handle natural changes in traffic throughout the day • Efficient Host Packing – Create models for throughput based on resource allocations and deploy most efficient container size
  • 28.2. Simulating Downstream Behavior • Latency – Test against response times during peak traffic hours at any time in the day • Errors & Failures – Test service behavior when downstream results are acting unreliably • Connectivity – Test resiliency and recovery when dependencies are unavailable
  • 29.3. Stateful Redlining Source Node • Source Node – Storage node to test • Dark Node – Exact replica of source node • Tee Traffic – Copy the incoming live traffic to source node to dark node 0-19 • Multiply Traffic – Generate extra load Dark Node on dark node based on incoming traffic
  • 30.Key Takeaways
  • 31.Reflection • Don’t Be Afraid of Risk • Prepare for the Surprises • Build Performance Mindset
  • 32.Don’t count servers. Make servers count.
  • 33.Thank youhttps://engineering.linkedin.com/blogchinajobs@linkedin.com
  • 34.