Gradient Descent
Last updated June 11, 2026
What is Gradient Descent in simple terms?
In simple terms, gradient descent is how AI learns by taking small steps toward fewer mistakes. Picture finding the bottom of a foggy valley by always stepping downhill — each step lowers the error a little.
What is Gradient Descent?
Gradient descent is the optimization method most AI models use to learn — repeatedly nudging their internal settings in the direction that reduces error, taking small steps downhill until the model's mistakes are as low as they can get.
When a model trains, it starts out useless — its internal settings are essentially random, so its predictions are wrong. Learning means adjusting those settings, called weights, until the predictions get good. Gradient descent is the method that does the adjusting. The model measures how wrong it currently is using a score called the loss (lower is better), and gradient descent works out, for each weight, which direction to nudge it to make the loss a little smaller. Then it takes a small step in that direction for every weight at once. Do this over and over — across enormous amounts of data and millions or billions of weights — and the model gradually descends toward settings that make few mistakes.
The standard way to picture it is a landscape. Imagine the model's error as the height of terrain: high ground is lots of mistakes, low ground is few, and the goal is to reach a valley bottom. You're standing somewhere on this landscape in fog so thick you can only feel the slope right under your feet. Gradient descent is the strategy of always stepping in the steepest downhill direction — the "gradient" is just the mathematical word for which way is downhill and how steep. The size of each step matters: too big and you overshoot the valley and bounce around; too small and learning crawls. That step size is one of the key dials, called the learning rate, that practitioners tune. In practice models use a fast variant that estimates the slope from small batches of data at a time rather than all of it at once.
Gradient descent is genuinely the engine under the hood of nearly all modern machine learning — deep learning, large language models, image generators, all of it learn essentially this way, paired with backpropagation, which is the technique that efficiently calculates which way is downhill for every weight. It isn't flawless: across the millions of dimensions of a real model, the foggy-valley path can get stuck on flat stretches where no direction feels clearly downhill, or settle into a dip that isn't the deepest one available — and a poorly chosen step size can stall or destabilize training. But it scales astonishingly well to huge models, and decades of refinements have made it reliable enough that this one simple idea — keep stepping toward less error — underpins the whole field.
Real-world example of Gradient Descent
Picture tuning an old radio with a single dial to land on a station, but you're blindfolded and can only hear how much static there is. You turn the dial a touch and the static drops, so you keep turning that way; it starts rising again, so you back off slightly — homing in on the clearest signal by always moving in whatever direction reduces the noise. Gradient descent does the same thing, except instead of one dial there are millions of them (the model's weights) and instead of static it's the model's error. Each training step turns every dial a little in the direction that lowers the error, and after enough steps the model lands on its clearest signal.
Related terms
Frequently asked questions about Gradient Descent
What is the difference between gradient descent and backpropagation?
They're partners in training but do different jobs. Backpropagation is the method that efficiently calculates, for every weight in the network, which direction and how much it should change to reduce the error. Gradient descent is what then uses those calculations to actually update the weights — taking a small step in the downhill direction. In short, backpropagation works out which way is downhill for each weight, and gradient descent takes the step. Together they form the core learning loop repeated millions of times during training.
How does gradient descent work?
The model measures how wrong it is with a loss score, then computes the gradient — the direction that would reduce that loss — for each of its weights. It nudges every weight a small amount in its downhill direction, lowering the error slightly, and repeats this across huge amounts of data. Over many iterations it descends toward settings with low error, like feeling your way to the bottom of a foggy valley by always stepping downhill. The step size, called the learning rate, controls how big each move is and has to be tuned carefully.
What is gradient descent used for?
It's the optimization method behind training almost all modern AI: neural networks, deep learning systems, large language models, and image generators all learn by some form of gradient descent. Any time a model improves by reducing its mistakes over a dataset, gradient descent — usually a fast variant that works on small batches of data at a time — is the process doing the improving. It's one of the most fundamental algorithms in machine learning, the engine that turns raw data and a starting point into a trained model.