Week2
Week 2

Week 2

Topic: Gradient Flow

Keynote Speaker: Yicheng Wu

Time: Jul 29, 20:00 - 21:30 pm

Venue: Room 341, School of Business

Tencent Meeting: #907-2153-6929

Compendium

Introduction

$$\begin{aligned} &\text{Suppose }\phi\in\mathbb{R}^D\text{, and }L(\phi):\mathbb{R}^D\to\mathbb{R}\text{ is smooth. Gradient flow is a smooth curve}\\ &\phi(t):\mathbb{R}\to\mathbb{R}^D\text{ such that} \end{aligned}$$

$$\frac{d\phi}{dt}=-\frac{\partial L}{\partial\phi}$$

1. Problem Setup

$$\begin{array}{l} \text{Suppose now that } I \text{ pairs of independent sample points } \{\mathbf{x}_{i},y_{i}\}_{i=1}^{l} \text{ have been obtained, A}\\ \text{model } f[\mathbf{x},\phi] \text{ needs to be used to fit the observed data.} \end{array}$$

2. Gradient Descent Algorithms

  • Gradient Descent Variants
    • Stochastic Gradient Descent
      • Mini-batch Stochastic Gradient Descent
  • Momentum Algorithm
    • Standard Momentum Algorithm
    • Nesterov Accelerated Gradient
  • Adaptive Subgradient Method
    • Adagrad
    • Adadelta
    • RMSprop

3. Gradient Flow in Linear Regression

Gradient Descent Update Rule: $$\phi_{t+1}=\phi_t-\alpha\cdot\frac{\partial L}{\partial\phi},$$ where $\phi_t$ represents the parameters at time $t$ and $\alpha$ is termed the learning rate. $$\frac{\phi_{t+1}-\phi_t}{\alpha}=-\frac{\partial L}{\partial\phi},$$ when an infinitesimally small learning rate $\alpha$ is employed: $$\frac{d\phi}{dt}=-\frac{\partial L}{\partial\phi}.$$ This ordinary differential equation (ODE) is known as gradient flow.

References

  1. Chang Liu. Gradient Flow. Retrieved from https://changliu00.github.io/static/Gradient-Flow.pdf, 2017.
  2. Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Jon Shlens, Benoit Steiner, Ilya Sutskever, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Oriol Vinyals, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. 2015.
  3. Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu. Advances in Optimizing Recurrent Networks. 2012.
  4. Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
  5. C. Darken, J. Chang, and J. Moody. Learning rate schedules for faster stochastic gradient search. Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop, (September):1–11, 1992.
  6. Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non convex optimization. arXiv, pages 1–14, 2014.
  7. Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large Scale Distributed Deep Networks. NIPS 2012: Neural Information Processing Systems, pages 1–11, 2012.
  8. Timothy Dozat. Incorporating Nesterov Momentum into Adam. ICLR Workshop, (1):2013–2016, 2016.
  9. John Duchi, Elad Hazan, and Yoram Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.