北大林宙辰:机器学习一阶算法的优化

2020-02-27 80浏览

  • 1.First-Order Optimization Methods in Machine Learning Zhouchen Lin (林宙辰) Peking University Aug. 27, 2016
  • 2.Outline NonlinearOp8miza8on:•  Past  (-­‐1990s)   •  Present  (1990s-­‐now)   •  Future  (now-­‐) ​min↓?????? ??????(??????),    ??????.??????.      ??????∈??????.
  • 3.Nonlinear Optimization Past  (-­‐1990s) •  Major  theories  and  techniques  of  op8miza8on  completed   •  –  1960s   •  1960s  –  1990s,  boom  due  to  the  inven8on  of  computers   •  Zeroth  order  methods   •  Interpola8on  Methods   •  PaJern  Search  Methods   •  ……   •  First  order  methods   •  Coordinate  Descent   •  Conjugate  Gradient   •  (Stochas8c)  Gradient/Subgradient  Descent   •  Ellipsoid  Method   •  Quasi-­‐Newton  Methods   •  (Augmented)  Lagrangian  Method  of  Mul8pliers   •  ……
  • 4.Nonlinear Optimization Past  (-­‐1990s) •  Second  order  methods   •  Newton’s  Methods   •  Sequen8al  Quadra8c  Programming   •  Interior  Point  Methods   •  ……
  • 5.Nonlinear Optimization Why  first  order  methods? •  Converge  rela8vely  fast  (#itera8ons)   •  Acceptable  accuracy  for  machine  learning   •  Rela8vely  cheap  in  storage  and  computa8on  (complexity  in  each  itera8on)   •  Important  for  big  data  era
  • 6.Nonlinear Optimization Present  (1990s-­‐now) •  Revive  and  refine  of  exis8ng  techniques   •  Spiral  Ascent   •  Applica8on  driven   •  Support  Vector  Machines   •  Deep  Learning   •  Big  Data
  • 7.Present (1990s-now) Advances  in  first-­‐order  methods •  Smooth  -­‐>  Nonsmooth   •  Convex  -­‐>  Nonconvex   •  Determinis8c  -­‐>  Stochas8c   •  One/Two  Blocks  -­‐>  Mul8ple  Blocks   •  Synchronous  -­‐>  Asynchronous   •  Convergence  &  Convergence  Rate
  • 8.Present (1990s-now) Smooth  -­‐>  Nonsmooth •  Smooth   •  Gradient ??????′(???​ ???↓?????? ) Sparsity  &  Low-­‐Rankness •  Nonsmooth   •  Subgradient   •  Proximal  Operator   ​??????↓?????? ∈????????????(???​ ???↓?????? ) ​min↓?????? ??????(??????)+​??????/2 ‖​ ?????? −??????‖↑2   •  Lineariza8on ??????(??????)≤??????(​??????↓?????? )+⟨??????′(​??????↓?????? ),??????​−??????↓?????? ⟩+​??????/2 ​‖??????−​ ??????↓?????? ‖↑2
  • 9.Present (1990s-now) Convex  -­‐>  Nonconvex •  Convex   •  Global  Op8mal •  Nonconvex   •  Nonincreasing  Objec8ve       •  Cluster  Points  are  KKT  Points   •  Converges  to  KKT  Point   (Kurdyka-­‐Lojasiewicz   ?????? Condi8on) (??????) ​??????↑′ ( ??????(??????)−??????(???​ ???↑∗ ) )????????????????????????(0,????????????(??????))≥1
  • 10.Present (1990s-now) Determinis8c  -­‐>  Stochas8c •  Determinis8c   ??????(???​ ???↓?????? ),  ??????′(???​ ???↓?????? ) •  Stochas8c   •  Stochas8c  Gradient/ADMM   ​1/?????? ∑??????=1↑??????▒???​ ???↓??????↑′ (??????) →???​ ???↓​ ??????↓?????? ↑′ (??????) •  Variance  Reduc8on   •  Stochas8c  Matrix  Computa8on   •  Randomized  SVD
  • 11.Present (1990s-now) One/Two  Blocks  -­‐>  Mul8ple  Blocks  (ADMM) •  One/Two  Blocks   •  Serial  Update   •  Mul8ple  Blocks   •  Parallel  Update   ​min↓?????? ​??????↓1 (???​ ???↓1 )+???​ ???↓2 (​??????↓2 ),  ??????.??????.      ​??????↓1 (???​ ???↓1 )+???​ ???↓2 (​ ??????↓2 )=??????. m​ in↓?????? ​??????↓1 (​??????↓1 )+⋯+​??????↓?????? (???​ ???↓?????? ),  ??????.??????.      ???​ ???↓1 (???​ ???↓1 )+⋯+​??????↓?????? (​??????↓?????? )=??????. (??????>2)
  • 12.Present (1990s-now) Synchronous  -­‐>  Asynchronous •  Synchronous   ???​ ???↓1↑?????? ,???​ ???↓2↑?????? ,⋯,???​ ???↓??????↑??????    ???​ ???↓1↑??????+1 ,​??????↓2↑??????+1 ,⋯,​ ??????↓??????↑??????+1    •  Asynchronous   ​??????↓1↑​??????↓1  ,​??????↓2↑​??????↓2  ,⋯,​ ??????↓??????↑???​ ???↓??????     ​??????↓1↑​??????↓1 +1 ,​??????↓2↑​??????↓2  +1 ,⋯,???​ ???↓??????↑???​ ???↓?????? +1Superscripts:Itera8on  numbers
  • 13.Present (1990s-now) Convergence  &  Convergence  Rate •  Improved  Convergence  for  Nonconvex  Op8miza8on  (see  before)   •  Weaker  Convergence  Condi8ons  for  Convex  Op8miza8on   •  Accelerated  Convergence   •  Nesterov  Extrapola8on   •  Variance  Reduc8on   ??????(???​ ???↑−1 )→??????(???​ ???↑−2 ),  ??????(???​ ???↑−1/2 )→??????(???​ ???↑ −1 )
  • 14.Nonlinear Optimization Future  (now-­‐) •  Super-­‐Large  Scale  Op8miza8on   •  Fully  Randomized  Algorithms   •  O(n*polylog(n))   •  Quantum  Op8miza8on   •  Quantumizaion  of  Classic  Algorithms   •  Purely  Quantum  Algorithms
  • 15.Thanks!