Reinforcement Learning (RL)
Reinforcement learning (RL) is a type of machine learning in which a system learns by trial and error, taking actions and adjusting its behavior based on rewards or penalties it receives, rather than being shown the correct answers in advance.
What is Reinforcement Learning (RL)?
Most machine learning learns from an answer key. You hand the system thousands of examples already labeled with the right answer — this email is spam, this photo is a dog — and it learns to match them. Reinforcement learning works without that answer key. Instead, the system (usually called the agent) is dropped into a situation, allowed to act, and given a score afterward: a reward when things go well, a penalty when they go badly. Each round is the same loop — the agent looks at the current situation, picks an action, and gets back both a new situation and a score — repeated over and over. Nobody tells it the single correct move at each step. It has to discover, through many rounds of trying, which sequences of actions tend to lead to higher rewards and which lead to trouble, building up a strategy — what researchers call a policy — for choosing actions that pay off in the long run.
The interesting part is that the best move right now is not always the one with the biggest immediate reward. A good RL system learns to think ahead, sometimes accepting a small loss early to set up a much bigger payoff later. That balance — known as the trade-off between exploitation (cashing in on what already works) and exploration (trying untested options that might turn out better) — is one of the central challenges of the approach, and it is what makes reinforcement learning well suited to problems that unfold over a series of decisions rather than a single yes-or-no call.
Reinforcement learning has a reputation for being finicky and expensive — it often needs enormous numbers of attempts to learn anything, which is why much of it happens in simulations where an agent can practice millions of times safely and cheaply before touching the real world. But when it works, it can find strategies that surprise even the people who built it. It is also a key ingredient in the AI assistants many people now use daily: a method called reinforcement learning from human feedback (RLHF) turns human preferences into the reward. People compare the AI's responses to show which they favor, a separate reward model learns to predict those preferences, and the system is then trained to produce more of what scores well. That technique is a big part of why today's chatbots feel more cooperative than the raw language models underneath them.
Real-world example
Picture a busy city intersection where an AI system controls the traffic lights. Nobody hands it a rulebook for the perfect light timing — traffic is far too unpredictable for that. Instead, it experiments: it tries holding the green a few seconds longer on the main road, then watches what happens to the length of the queues. Shorter queues and smoother flow earn it a reward; long tailbacks and gridlock count against it. Run that loop thousands of times — at first mostly in a simulation of the intersection — and the system gradually works out timing patterns that keep cars moving better than a fixed schedule ever could, adapting to rush hour, quiet evenings, and the surge after a nearby event lets out. It was never told the right answer. People defined what counts as success — shorter queues, smoother flow — and the system learned how to achieve it purely from how it was scored.
Related terms
Frequently asked questions
What is the difference between reinforcement learning and supervised learning?
Supervised learning trains on data that already carries the correct answers — every example comes labeled, and the system learns to reproduce those labels. Reinforcement learning gets no labeled answers. It learns from rewards and penalties earned by acting in an environment, figuring out good behavior through trial and error rather than copying a provided answer key. Put simply: supervised learning studies a solved exam, while reinforcement learning learns by playing the game and keeping score.
Is reinforcement learning how ChatGPT and other AI assistants are trained?
Partly. The bulk of an AI assistant's knowledge comes from training on huge amounts of text, which is a different process. But a finishing step called reinforcement learning from human feedback (RLHF) uses reinforcement learning to refine how the model responds: human raters compare answers to show which they prefer, a reward model learns that preference, and reinforcement learning then steers the main model toward responses that score well. So reinforcement learning is not the whole story, but it is an important part of why modern assistants behave the way they do.
Why does reinforcement learning need so many attempts to learn?
Because it starts out knowing nothing about which actions are good and has to discover that purely from rewards. With no answer key to copy, the only way to find out whether a choice was wise is to try it and see the score — often much later, once the consequences play out. Sorting the genuinely good moves from lucky ones takes a great many repetitions, which is why so much reinforcement learning is done in fast, cheap simulations where the system can practice millions of times before it is trusted with anything real.