javaatspeed mar2018 qcon

2020-02-27 59浏览

  • 1.©2017 Azul Systems, Inc.
  • 2.Java at Speed getting the most out of modern hardware Gil Tene, CTO & co-Founder, Azul Systems ©2017 Azul Systems, Inc.
  • 3.High level agenda Intro & Motivation Some hardware trends and new features Some compiler stuff A microbenchmark detour Some more compiler stuff Warmup, and what we can do about it Putting it all together (and maybe some bragging) ©2017 Azul Systems, Inc.
  • 4.Aboutme:Gil Tene co-founder, CTO @Azul Systems Have been working on “think different” GC and runtime approaches since 2002 A Long history building Virtual & Physical Machines, Operating Systems, Enterprise apps, etc… At Azul we make JVMs that dramatically improve response time and latency behaviors I also depress people by demonstrating how terribly wrong their latency measurements are… ©2016 Azul Systems, Inc. * working on real-world trash compaction issues, circa 2004
  • 5.Speed What is it good for?
  • 6.Are you fast? ©2017 Azul Systems, Inc.
  • 7.Are you fast when new code rolls out? ©2017 Azul Systems, Inc.
  • 8.Are you fast when it matters? ©2017 Azul Systems, Inc.
  • 9.Are you fast at Market Open? ©2017 Azul Systems, Inc.
  • 10.Are you fast when you actually trade? ©2017 Azul Systems, Inc.
  • 11.Are you reliably fast? ©2017 Azul Systems, Inc.
  • 12.What do you mean by “fast”? Speed ©2017 Azul Systems, Inc. Time ??
  • 13.What do you mean by “fast”? Speed ©2017 Azul Systems, Inc. Time
  • 14.Speed in the Java world…
  • 15.Code distribution (by optimization level) 1 0.9 0.8 0.7 Tier 1 0.6 (profiling) 0.5 Optimized 0.4 0.3 0.2 Interpreted 0.1 0 0.00 10.00 20.00 30.00 40.00 Interpreted % 50.00 60.00 70.00 Tier 1 (profiling) % Optimized % 80.00 90.00 100.00 ©2017 Azul Systems, Inc.
  • 16.25.00 20.00 15.00 10.00 5.00 0.00 0.00 Response time (with contribution by optimization level) 10.00 20.00 30.00 40.00 50.00 Interpreted Tier1 (profiling) 60.00 Optimized 70.00 GC Pause 80.00 90.00 100.00 ©2017 Azul Systems, Inc.
  • 17.1.4 1.2 1 0.8 0.6 0.4 0.2 0 0.00 Speed (with contribution by optimization level) 10.00 20.00 30.00 40.00 Interpreted 50.00 60.00 70.00 Tier1 (profiling) Optimized 80.00 90.00 100.00 ©2017 Azul Systems, Inc.
  • 18.Some notes on modern servers
  • 19.Code name Model Intro Date cores/chip Nehalem EP Xeon 5500 March 2009 4 Westemere EP Xeon 5600 June 2010 6 Sandy Bridge EP E5-2600 March 2012 AVX 8 Ivy Bridge EP E5-2600 V2 Sep. 2013 12 Haswell EP E5-2600 V3 Sep. 2014 AVX2, BMI, BMI2 18 Broadwell EP E5-2600 V4 March 2016 TSX, HLE 22 Skylake SP Silver/Gold/… July 2017 AVX512 32
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.CachesL1:32KB I$, 32KB D$L2:256KBL3:~2.5MB/core shared L3. (up to 55MB per socket)TLBs:2MB page support improved with E5-v3 & E5-v4 L2 DTLB used to be 4KB page only. Now 4KB/2MB 1024 entry 4KB/2MB L2 in E5-v3. 1536 entry in E5-v4
  • 25.SystemTopology:2 sockets
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.Some machine code zoom-in
  • 33.
  • 34.
  • 35.A simple array summing loop
  • 36.
  • 37.
  • 38.This is on X5690 (Westmere) Uses SSE (128bit)
  • 39.This is on E5-2690 v4 (Broadwell) Uses AVX2 (256bit)
  • 40.A conditional array cell addition loop
  • 41.
  • 42.Traditional JVM JITs per-element jumps, 2 elements per iteration
  • 43.This is on E5-2690 v4 (Broadwell) Vectorized with AVX2 32 elements per iteration
  • 44.This is on Skylake SP Vectorized with AVX512 64 elements per iteration
  • 45.Better JIT’ing is basically about speed Speed (with contribution by optimization level) Improved optimization 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 Interpreted Tier1 (profiling) Optimized ©2017 Azul Systems, Inc.
  • 46.Compiler Stuff
  • 47.Some simple compiler tricks
  • 48.Code can be reordered... int doMath(int x, int y, int z) { int a = x + y; int b = x - y; int c = z + x; return a + b; } Can be reorderedto:©2017 Azul Systems, Inc. int doMath(int x, int y, int z) { int c = z + x; int b = x - y; int a = x + y; return a + b; }
  • 49.Dead code can be removed int doMath(int x, int y, int z) { int a = x + y; int b = x - y; int c = z + x; return a + b; } Can be reducedto:©2017 Azul Systems, Inc. int doMath(int x, int y, int z) { int a = x + y; int b = x - y; return a + b; }
  • 50.Values can be propagated int doMath(int x, int y, int z) { int a = x + y; int b = x - y; int c = z + x; return a + b; } Can be reducedto:int doMath(int x, int y, int z) { return x + y + x - y; } ©2017 Azul Systems, Inc.
  • 51.Math can be simplified int doMath(int x, int y, int z) { int a = x + y; int b = x - y; int c = z + x; return a + b; } Can be reducedto:int doMath(int x, int y, int z) { return x + x; } ©2017 Azul Systems, Inc.
  • 52.Some more compiler tricks
  • 53.propagation can affect flow constants can be propagated to pre-computeresults:int computeBias() { int bias, val = 5; if (val > 10) { bias = computeComplicatedBias(val); else { bias = 1; } return bias; } Can be reducedto:©2017 Azul Systems, Inc. int computeBias() { return 1; }
  • 54.Reads can be cached class Point { int x, y; } int distanceRatio(Point a) { int distanceTo = a.x - start; int distanceAfter = end - a.x; return distanceTo/distanceAfter; } Is (semantically) the same as ©2017 Azul Systems, Inc. int distanceRatio(Point a) { int x = a.x; int distanceTo = x - start; int distanceAfter = end - x; return distanceTo/distanceAfter; }
  • 55.Reads can be cached class Trigger { boolean flag; } void loopUntilFlagSet(Tigger a) { while (!a.flag) { loopcount++; } } Is the sameas:©2017 Azul Systems, Inc. void loopUntilFlagSet(Object a) { boolean flagIsSet = a.flag; while (!flagIsSet) { loopcount++; } } That’s what volatile is for...
  • 56.Writes can be eliminated Intermediate values might never be visible void updateDistance(Point a) { int distance = 100; a.x = distance; a.x = distance * 2; a.x = distance * 3; } Is the same as void updateDistance(Point a) { a.x = 300; } ©2017 Azul Systems, Inc.
  • 57.Writes can be eliminated Intermediate values might never be visible void updateDistance(SomeObject a) { a.visibleValue = 0; for (int i = 0; i < 1000000; i++) { a.internalValue = i; } a.visibleValue = a.internalValue; } Is the same as ©2017 Azul Systems, Inc. void updateDistance(SomeObject a) { a.internalValue = 1000000; a.visibleValue = 1000000; }
  • 58.Inlining... public class Thing { private int x; public final int getX() { return x }; } ... myX = thing.getX(); Is the same as Class Thing { int x; } ... myX = thing.x; ©2017 Azul Systems, Inc.
  • 59.Inlining is very powerful inlining exposes other optimizations int computeBias(int val) { int bias; if (val > 10) { bias = computeComplicatedBias(val); else { bias = 1; } return bias; } … myBias = computeBias(5); Can be reducedto:©2017 Azul Systems, Inc. myBias = 1;
  • 60.A uBenchmark sidetrack
  • 61.A simple loop uBenchmark (0) Turns out this is “really fast” Asin:when count = 1,000,000 we complete ~500,000,000 calls per second (for 5,000,000,000,000,000 iterations/sec)
  • 62.A simple loop uBenchmark (1) Still “impossibly fast” It’s all “provably dead code”. Compiler translates the method to a no-op
  • 63.A simple loop uBenchmark (2) Better? No. Still “impossibly fast”. Compiler returns count. No loop.
  • 64.A simple loop uBenchmark (3) Better? Depends. On HotSpot and Zing C2, yes. But Zing’s new Falcon compiler is smart enough to recognize arithmetic series
  • 65.A simple loop uBenchmark (4) How about this? Zing’s Falcon will even figure out this one. (it returns zero)
  • 66.A simple loop uBenchmark (5) Seems to be complicated enough to defeat *current* compilers…
  • 67.uBenchmarking Takeaways uBenchmarking is “hard”. As in “very tricky” You may not be measuring what you think you are “Trickiness” can change over time, between versions Sanity check EVERYTHING Use jmh Use jmh Use jmh And even then, suspect everything ©2017 Azul Systems, Inc.
  • 68.Back to compiler stuff
  • 69.Speculative compiler tricks JIT compilers can do things that static compilers can have a hard time with…
  • 70.Untaken path example “Never taken” paths can be optimized away with benefits:int computeMagnitude(int val) { if (val > 10) { bias = computeBias(val); else { bias = 1; } return Math.log10(bias + 99); } When all values so far were <= 10 , could be compiledto:©2017 Azul Systems, Inc. int computeMagnitude(int val) { if (val > 10) uncommonTrap(); return 2; }
  • 71.Implicit Null Check example All field and array access in Java is null checked x = foo.x; is (in equivalent required machine code): if (foo == null) throw new NullPointerException(); x = foo.x; But compiler can “hope” for non-nulls, and handle SEGVx = foo.x; This is faster *IF* no nulls are encountered… ©2017 Azul Systems, Inc.
  • 72.Class Hierarchy Analysis (CHA) Can perform global analysis on currently loaded code Deduce stuff about inheritance, method overrides, etc. Can make optimization decisions based on assumptions Re-evaluate assumptions when loading new classes Throw away code that conflicts with assumptions before class loading makes them invalid ©2017 Azul Systems, Inc.
  • 73.Inlining works without “final” public class Animal { private int color; public int getColor() { return color }; } ... myColor = animal.getColor(); Is the same as Class Animal { *THIS* (CHA) is why int color; Java field accessors } ... are free & clean myColor = animal.color; As long as only one implementer of getColor() exists ©2017 Azul Systems, Inc.
  • 74.Inlining monomorphic sites public class Animal { private int color; public int getColor() { return color }; } ... myColor = animal.getColor(); Can be convertedto:... if (animal.type != Dog) uncommonTrap(); myColor = animal.color; Even if we have multiple conflicting implementors… ©2017 Azul Systems, Inc.
  • 75.Deoptimization
  • 76.Code distribution (by optimization level) 1 0.9 0.8 0.7 Tier 1 0.6 (profiling) 0.5 Optimized 0.4 0.3 0.2 Interpreted 0.1 0 0.00 10.00 20.00 30.00 40.00 Interpreted % 50.00 60.00 70.00 Tier 1 (profiling) % Optimized % 80.00 90.00 100.00 ©2017 Azul Systems, Inc.
  • 77.Deoptimization:Adaptive compilation is… adaptive Micro-benchmarking is a black art So is the art of the Warmup Running code long enough to compile is just the start… Deoptimizations can occur at any time often occur after you *think* the code is warmed up. Many potential causes ©2017 Azul Systems, Inc.
  • 78.Warmup often doesn’t cut it… CommonExample:Trading system wants to have the first trade be fast So run 20,000 “fake” messages through the system to warm up let JIT compilers optimize, learn, and deopt before actual trades But… Code is written to do different things “if this is a fake message” e.g. “Don’t send to the exchange if this is a fake message” What really happens JITs optimize for fake path, including speculatively assuming “fake” First real message through causes a deopt... ©2017 Azul Systems, Inc.
  • 79.Market Open ... Java at Market Open ©2017 Azul Systems, Inc.
  • 80.Java’s “Just In Time” Reality ... Starts slow, learns fast Lazy loading & initialization Warmup Aggressively optimized for the common case (temporarily) Reverts to slower execution to adapt Deoptimization ©2017 Azul Systems, Inc.
  • 81.Logging and “replaying” optimizations Log optimization information Record ongoing optimization decisions and stats Record optimization dependencies Establish “stable optimization state” at end of previous run Read prior logs at startup “Prime” JVM with knowledge of prior stable optimizations Apply optimizations as their dependencies get resolved Build workflow to promote confidence Let you know if/when all optimizations have been applied If some optimization haven’t been applied, let you know why… ©2017 Azul Systems, Inc.
  • 82.avoid deoptimization Load Start Deoptimization ... Java at “Load Start” ©2017 Azul Systems, Inc.
  • 83.Load Start ... Java at “Load Start” With de-optimization avoided ©2017 Azul Systems, Inc.
  • 84.Warmup? Load Start Be Fast From The Start ... Java at “Load Start” ©2017 Azul Systems, Inc.
  • 85.Load Start ... Java at “Load Start” With pre-loading of prior optimizations ©2017 Azul Systems, Inc.
  • 86.Speed improvements Speed (with contribution by optimization level) 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 Interpreted Tier1 (profiling) Optimized ©2017 Azul Systems, Inc.
  • 87.Speed (with contribution by optimization level) Optimization Replay Better JIT’ing 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 Interpreted Tier1 (profiling) Optimized Optimized (Zing) ©2017 Azul Systems, Inc.
  • 88.Speed (with contribution by optimization level) Optimization Replay Better JIT’ing GC (without the pauses) 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 Interpreted Tier1 (profiling) Optimized Optimized (Zing) ©2017 Azul Systems, Inc.
  • 89.C4 Garbage Collector ELIMINATES Garbage Collection as a concern for enterprise applications ©2017 Azul Systems, Inc.
  • 90.A simple visual summary This ison HotSpot This ison Zing ©2015 Azul Systems, Inc. Any Questions?
  • 91.GC Tuning ©2017 Azul Systems, Inc.
  • 92.Java GC tuning is “hard”… Examples of actual command line GC tuningparameters:Java -Xmx12g -XX:MaxPermSize=64M-XX:PermSize=32M-XX:MaxNewSize=2g-XX:NewSize=1g-XX:SurvivorRatio=128-XX:+UseParNewGC'>XX:+UseParNewGC