北大林宙辰：机器学习一阶算法的优化-在线文稿-面试哥

1.First-Order Optimization Methods in Machine Learning Zhouchen Lin （林宙辰） Peking University Aug. 27, 2016

2.Outline NonlinearOp8miza8on:•  Past (-‐1990s) •  Present (1990s-‐now) •  Future (now-‐) min↓?????? ??????(??????), ??????.??????. ??????∈??????.

3.Nonlinear Optimization Past (-‐1990s) •  Major theories and techniques of op8miza8on completed •  – 1960s •  1960s – 1990s, boom due to the inven8on of computers •  Zeroth order methods •  Interpola8on Methods •  PaJern Search Methods •  …… •  First order methods •  Coordinate Descent •  Conjugate Gradient •  (Stochas8c) Gradient/Subgradient Descent •  Ellipsoid Method •  Quasi-‐Newton Methods •  (Augmented) Lagrangian Method of Mul8pliers •  ……

4.Nonlinear Optimization Past (-‐1990s) •  Second order methods •  Newton’s Methods •  Sequen8al Quadra8c Programming •  Interior Point Methods •  ……

5.Nonlinear Optimization Why ﬁrst order methods? •  Converge rela8vely fast (#itera8ons) •  Acceptable accuracy for machine learning •  Rela8vely cheap in storage and computa8on (complexity in each itera8on) •  Important for big data era

6.Nonlinear Optimization Present (1990s-‐now) •  Revive and reﬁne of exis8ng techniques •  Spiral Ascent •  Applica8on driven •  Support Vector Machines •  Deep Learning •  Big Data

7.Present (1990s-now) Advances in ﬁrst-‐order methods •  Smooth -‐> Nonsmooth •  Convex -‐> Nonconvex •  Determinis8c -‐> Stochas8c •  One/Two Blocks -‐> Mul8ple Blocks •  Synchronous -‐> Asynchronous •  Convergence & Convergence Rate

8.Present (1990s-now) Smooth -‐> Nonsmooth •  Smooth •  Gradient ??????′(??? ???↓?????? ) Sparsity & Low-‐Rankness •  Nonsmooth •  Subgradient •  Proximal Operator ??????↓?????? ∈????????????(??? ???↓?????? ) min↓?????? ??????(??????)+??????/2 ‖ ?????? −??????‖↑2   •  Lineariza8on ??????(??????)≤??????(??????↓?????? )+⟨??????′(??????↓?????? ),??????−??????↓?????? ⟩+??????/2 ‖??????− ??????↓?????? ‖↑2

9.Present (1990s-now) Convex -‐> Nonconvex •  Convex •  Global Op8mal •  Nonconvex •  Nonincreasing Objec8ve •  Cluster Points are KKT Points •  Converges to KKT Point (Kurdyka-‐Lojasiewicz ?????? Condi8on) (??????) ??????↑′ ( ??????(??????)−??????(??? ???↑∗ ) )????????????????????????(0,????????????(??????))≥1

10.Present (1990s-now) Determinis8c -‐> Stochas8c •  Determinis8c ??????(??? ???↓?????? ), ??????′(??? ???↓?????? ) •  Stochas8c •  Stochas8c Gradient/ADMM 1/?????? ∑??????=1↑??????▒??? ???↓??????↑′ (??????) →??? ???↓ ??????↓?????? ↑′ (??????) •  Variance Reduc8on •  Stochas8c Matrix Computa8on •  Randomized SVD

11.Present (1990s-now) One/Two Blocks -‐> Mul8ple Blocks (ADMM) •  One/Two Blocks •  Serial Update •  Mul8ple Blocks •  Parallel Update min↓?????? ??????↓1 (??? ???↓1 )+??? ???↓2 (??????↓2 ), ??????.??????. ??????↓1 (??? ???↓1 )+??? ???↓2 ( ??????↓2 )=??????. m in↓?????? ??????↓1 (??????↓1 )+⋯+??????↓?????? (??? ???↓?????? ), ??????.??????. ??? ???↓1 (??? ???↓1 )+⋯+??????↓?????? (??????↓?????? )=??????. (??????>2)

12.Present (1990s-now) Synchronous -‐> Asynchronous •  Synchronous ??? ???↓1↑?????? ,??? ???↓2↑?????? ,⋯,??? ???↓??????↑??????  ??? ???↓1↑??????+1 ,??????↓2↑??????+1 ,⋯, ??????↓??????↑??????+1  •  Asynchronous ??????↓1↑??????↓1  ,??????↓2↑??????↓2  ,⋯, ??????↓??????↑??? ???↓??????   ??????↓1↑??????↓1 +1 ,??????↓2↑??????↓2  +1 ,⋯,??? ???↓??????↑??? ???↓?????? +1Superscripts:Itera8on numbers

13.Present (1990s-now) Convergence & Convergence Rate •  Improved Convergence for Nonconvex Op8miza8on (see before) •  Weaker Convergence Condi8ons for Convex Op8miza8on •  Accelerated Convergence •  Nesterov Extrapola8on •  Variance Reduc8on ??????(??? ???↑−1 )→??????(??? ???↑−2 ), ??????(??? ???↑−1/2 )→??????(??? ???↑ −1 )

14.Nonlinear Optimization Future (now-‐) •  Super-‐Large Scale Op8miza8on •  Fully Randomized Algorithms •  O(n*polylog(n)) •  Quantum Op8miza8on •  Quantumizaion of Classic Algorithms •  Purely Quantum Algorithms

15.Thanks!