Why Momentum Really Works. The math of gradient descent with momentum. (distill.pub) posted 12mo ago with 3 replies (collapse hidden) 66 Why Momentum Really (view hidden) 66 I keep coming back to this classic every time I end up thinking about gradient descent optimizers. Unfortunately we're still left with this pesky "learning rate" parameter that has to be set empirically by what causes convergence vs divergence. ... 12mo ago (collapse hidden) 33 I keep coming back t (view hidden) 33 Do you know if there is any predictability to the effect? I would guess it's related to catastrophic forgetting aka the reason stochastic gradient descent has to be stochastic. Basically if you update on one set of evidence without locking in those learnin... 11mo ago (collapse hidden) 33 Do you know if there (view hidden) 33