What is Gradient Descent?

Gradient descent is the fundamental optimization algorithm used to train ML models. It iteratively adjusts model parameters in the direction that reduces the loss function, guided by the gradient (slope) of the loss with respect to each parameter.

workBrowse Machine Learning Jobs

Gradient descent is the engine that powers almost all neural network training. The idea is simple: compute how the loss changes with respect to each parameter (the gradient), then adjust parameters in the opposite direction to reduce the loss. This process repeats for many iterations until the model converges.

Three main variants differ in how much data they use per update. Batch gradient descent computes gradients on the entire dataset, giving accurate but expensive updates. Stochastic gradient descent (SGD) uses a single example per update, introducing noise but enabling faster iteration. Mini-batch gradient descent, the most common approach, uses a small batch of examples, balancing gradient accuracy with computational efficiency.

Advanced optimizers build on gradient descent with momentum, adaptive learning rates, or both. SGD with momentum accumulates past gradients to smooth updates and accelerate convergence. Adam (Adaptive Moment Estimation) maintains per-parameter adaptive learning rates based on first and second moment estimates of gradients. AdamW adds decoupled weight decay for better regularization. These optimizers significantly improve convergence speed and final model quality.

The learning rate is the most critical hyperparameter. Too high causes divergence, too low causes slow convergence. Learning rate schedules (cosine annealing, warm-up, step decay) adjust the rate during training. The learning rate warm-up, where training starts with a very small rate that gradually increases, has become standard for training large Transformers.

How Gradient Descent Works

At each step, the gradient of the loss function with respect to model parameters is computed (using backpropagation in neural networks). Parameters are then updated by subtracting the gradient multiplied by the learning rate. This moves parameters in the direction that reduces the loss.

trending_upCareer Relevance

Gradient descent and its variants are the most fundamental optimization concepts in ML. Every practitioner must understand how different optimizers work, how to set learning rates, and how to diagnose training issues. This is a core interview topic.

See Machine Learning jobsarrow_forward

Frequently Asked Questions

What is the difference between SGD and Adam?

SGD uses a fixed learning rate for all parameters, optionally with momentum. Adam adapts the learning rate for each parameter based on gradient history. Adam converges faster initially but SGD with momentum sometimes achieves better final performance, especially for large models.

How do I choose the right learning rate?

Start with common defaults (1e-3 for Adam, 1e-1 for SGD). Use learning rate finders to sweep rates and find a good starting point. Apply schedules like cosine annealing. Monitor training loss for divergence (rate too high) or stagnation (rate too low).

Is gradient descent understanding required for AI jobs?

Absolutely. It is the most fundamental optimization concept in ML, tested in virtually every technical interview. Understanding optimizers, learning rates, and convergence behavior is essential for any role involving model training.