Stochastic Gradient Descent (SGD)

AdvancedDeep Learning

Last updated June 14, 2026

What is Stochastic Gradient Descent in simple terms?

In simple terms, stochastic gradient descent improves a model in lots of quick, rough steps instead of a few slow, careful ones. Like finding the bottom of a foggy hill by checking direction often rather than surveying first.

What is Stochastic Gradient Descent?

Stochastic gradient descent (SGD) is a method for training machine learning models that updates the model's settings using the error from one small, randomly chosen batch of data at a time, rather than the whole dataset at once, making training far faster and more practical at large scale.

Training a model means searching for the settings that make its errors smallest — and the standard way to search is gradient descent: at each step, work out which direction reduces the error, take a small step that way, repeat. The catch is the cost of working out that direction. The thorough way is to check the model's error across *every* example in the dataset before taking a single step. With millions of examples, each step becomes enormously expensive, so the model improves agonizingly slowly. Stochastic gradient descent (SGD) is the practical fix. Instead of consulting the whole dataset for each step, it estimates the downhill direction from just one small, randomly drawn batch of examples, takes a step, then draws another batch and steps again.

"Stochastic" simply means "involving randomness" — here, the randomness of which small batch you happen to look at each time. Each step is therefore a rougher, noisier estimate of the true downhill direction than the full-dataset version would give. That sounds like a weakness, and step-for-step it is: any single step might wobble slightly off the ideal path. But it's overwhelmingly worth it, because you can take hundreds or thousands of cheap, quick steps in the time one thorough step would cost — and all those quick steps, averaged out, still march reliably toward lower error. It's the difference between carefully surveying an entire foggy valley before each move versus just feeling the slope under your feet every few paces and pressing on. The second gets you down far faster, even if your path zig-zags a little.

There's an unexpected bonus to the noise. Because each step is a slightly random nudge, SGD tends *not* to get permanently stuck in shallow dips that aren't truly the best spot — the jitter can knock it out and let it keep descending, where a perfectly smooth descent might have settled too early. This is part of why SGD and its refined variants are the standard engines for training neural networks, including the very large ones behind modern AI. The honest caveats: the batch size and the step size (the learning rate) need tuning, because too much noise or too large a step can stop the search from settling; and like all optimization, SGD only minimizes the error measure you give it, which had better reflect what you actually want.

Real-world example of Stochastic Gradient Descent

A team is training an image model on ten million photos. Doing it the thorough way — having the model look at all ten million photos to compute one improvement step — would mean waiting hours for each single nudge, and thousands of nudges are needed. Hopeless. With stochastic gradient descent, the model instead grabs a random handful of photos (say 64), computes a quick, rough sense of how to improve from just those, adjusts itself, then grabs the next random 64 and repeats. Each step is based on a tiny, slightly noisy sample, so the path it takes wobbles — but it fires off updates constantly instead of stalling, and over millions of these cheap steps the model steadily sharpens. The team gets a trained model in a reasonable time precisely because SGD trades perfect steps for fast ones.

Related terms

Frequently asked questions about Stochastic Gradient Descent

What is the difference between stochastic gradient descent and (batch) gradient descent?

Both search for the model settings that minimize error by stepping downhill, but they differ in how much data each step uses. Plain (batch) gradient descent computes each step from the *entire* dataset — accurate per step, but painfully slow at large scale. Stochastic gradient descent computes each step from just one small, random batch — each step is rougher and noisier, but you can take vastly more of them in the same time. In practice SGD wins for large datasets because many fast, approximate steps beat a few slow, exact ones, and its built-in noise even helps it avoid getting stuck in poor spots. **2. Mechanism — How does stochastic gradient descent work?**

How does stochastic gradient descent work?

It repeatedly improves the model using small random samples. Each round, it draws a small batch of training examples, measures how wrong the model is on just that batch, estimates which direction would reduce that error, and nudges the model's settings a small step in that direction. Then it draws a fresh random batch and repeats, cycling through the data many times. Because each step uses only a sample, it's quick but noisy — yet across thousands of steps the noise averages out and the model reliably descends toward lower error. The step size (learning rate) and batch size are key tuning choices. **3. Application — What is stochastic gradient descent used for?**

What is stochastic gradient descent used for?

It's the standard method for training machine learning models on large datasets — most importantly, it (and its refined variants) is how essentially all modern neural networks are trained, including the large models behind today's AI. Without it, training on the huge datasets these models need would be far too slow to be practical. Anywhere a model must learn from more data than could be processed all at once per step, SGD's approach of learning from quick random samples is the workhorse that makes training feasible.