Gradient descent is the workhorse behind many machine learning algorithms. It’s easy to memorise the update rule, but understanding the behaviour of gradient descent is much more valuable than just knowing the formula.
At its core, gradient descent is a search strategy:
- Compute how the loss changes with respect to each parameter.
- Move in the direction that reduces the loss the most (the negative gradient).
- Repeat until changes become small or performance plateaus.
During my MSc, I implemented several variants from scratch:
- Batch Gradient Descent: Uses the full dataset each step; stable but slow.
- Stochastic Gradient Descent (SGD): Updates per sample; noisy but often finds minima that generalise better.
- Mini-batch Gradient Descent: A balance between the two and the de facto standard in deep learning.
I experimented with different learning rates and saw firsthand how:
- Too large a learning rate leads to divergence or oscillation.
- Too small leads to painfully slow convergence.
- Schedules (step decay, cosine, etc.) can significantly improve training.
These experiments helped me see why optimisers like Adam, RMSProp, and momentum methods exist: they adapt the step size and direction using history, making training more robust to poorly scaled gradients.
For someone coming from a systems and engineering background, gradient descent feels like an iterative feedback loop: measure, adjust, repeat. That mental model has helped me reason about both ML training and optimisation problems in other domains.