北大林宙辰:机器学习一阶算法的优化
2020-02-27 80浏览
- 1.First-Order Optimization Methods in Machine Learning Zhouchen Lin (林宙辰) Peking University Aug. 27, 2016
- 2.Outline NonlinearOp8miza8on:• Past (-‐1990s) • Present (1990s-‐now) • Future (now-‐) min↓?????? ??????(??????), ??????.??????. ??????∈??????.
- 3.Nonlinear Optimization Past (-‐1990s) • Major theories and techniques of op8miza8on completed • – 1960s • 1960s – 1990s, boom due to the inven8on of computers • Zeroth order methods • Interpola8on Methods • PaJern Search Methods • …… • First order methods • Coordinate Descent • Conjugate Gradient • (Stochas8c) Gradient/Subgradient Descent • Ellipsoid Method • Quasi-‐Newton Methods • (Augmented) Lagrangian Method of Mul8pliers • ……
- 4.Nonlinear Optimization Past (-‐1990s) • Second order methods • Newton’s Methods • Sequen8al Quadra8c Programming • Interior Point Methods • ……
- 5.Nonlinear Optimization Why first order methods? • Converge rela8vely fast (#itera8ons) • Acceptable accuracy for machine learning • Rela8vely cheap in storage and computa8on (complexity in each itera8on) • Important for big data era
- 6.Nonlinear Optimization Present (1990s-‐now) • Revive and refine of exis8ng techniques • Spiral Ascent • Applica8on driven • Support Vector Machines • Deep Learning • Big Data
- 7.Present (1990s-now) Advances in first-‐order methods • Smooth -‐> Nonsmooth • Convex -‐> Nonconvex • Determinis8c -‐> Stochas8c • One/Two Blocks -‐> Mul8ple Blocks • Synchronous -‐> Asynchronous • Convergence & Convergence Rate
- 8.Present (1990s-now) Smooth -‐> Nonsmooth • Smooth • Gradient ??????′(??? ???↓?????? ) Sparsity & Low-‐Rankness • Nonsmooth • Subgradient • Proximal Operator ??????↓?????? ∈????????????(??? ???↓?????? ) min↓?????? ??????(??????)+??????/2 ‖ ?????? −??????‖↑2 • Lineariza8on ??????(??????)≤??????(??????↓?????? )+⟨??????′(??????↓?????? ),??????−??????↓?????? ⟩+??????/2 ‖??????− ??????↓?????? ‖↑2
- 9.Present (1990s-now) Convex -‐> Nonconvex • Convex • Global Op8mal • Nonconvex • Nonincreasing Objec8ve • Cluster Points are KKT Points • Converges to KKT Point (Kurdyka-‐Lojasiewicz ?????? Condi8on) (??????) ??????↑′ ( ??????(??????)−??????(??? ???↑∗ ) )????????????????????????(0,????????????(??????))≥1
- 10.Present (1990s-now) Determinis8c -‐> Stochas8c • Determinis8c ??????(??? ???↓?????? ), ??????′(??? ???↓?????? ) • Stochas8c • Stochas8c Gradient/ADMM 1/?????? ∑??????=1↑??????▒??? ???↓??????↑′ (??????) →??? ???↓ ??????↓?????? ↑′ (??????) • Variance Reduc8on • Stochas8c Matrix Computa8on • Randomized SVD
- 11.Present (1990s-now) One/Two Blocks -‐> Mul8ple Blocks (ADMM) • One/Two Blocks • Serial Update • Mul8ple Blocks • Parallel Update min↓?????? ??????↓1 (??? ???↓1 )+??? ???↓2 (??????↓2 ), ??????.??????. ??????↓1 (??? ???↓1 )+??? ???↓2 ( ??????↓2 )=??????. m in↓?????? ??????↓1 (??????↓1 )+⋯+??????↓?????? (??? ???↓?????? ), ??????.??????. ??? ???↓1 (??? ???↓1 )+⋯+??????↓?????? (??????↓?????? )=??????. (??????>2)
- 12.Present (1990s-now) Synchronous -‐> Asynchronous • Synchronous ??? ???↓1↑?????? ,??? ???↓2↑?????? ,⋯,??? ???↓??????↑?????? ??? ???↓1↑??????+1 ,??????↓2↑??????+1 ,⋯, ??????↓??????↑??????+1 • Asynchronous ??????↓1↑??????↓1 ,??????↓2↑??????↓2 ,⋯, ??????↓??????↑??? ???↓?????? ??????↓1↑??????↓1 +1 ,??????↓2↑??????↓2 +1 ,⋯,??? ???↓??????↑??? ???↓?????? +1Superscripts:Itera8on numbers
- 13.Present (1990s-now) Convergence & Convergence Rate • Improved Convergence for Nonconvex Op8miza8on (see before) • Weaker Convergence Condi8ons for Convex Op8miza8on • Accelerated Convergence • Nesterov Extrapola8on • Variance Reduc8on ??????(??? ???↑−1 )→??????(??? ???↑−2 ), ??????(??? ???↑−1/2 )→??????(??? ???↑ −1 )
- 14.Nonlinear Optimization Future (now-‐) • Super-‐Large Scale Op8miza8on • Fully Randomized Algorithms • O(n*polylog(n)) • Quantum Op8miza8on • Quantumizaion of Classic Algorithms • Purely Quantum Algorithms
- 15.Thanks!