Transformer
A Transformer is a type of neural network architecture that processes language by learning which parts of a text are most relevant to each other, and is the foundation on which most modern AI language systems are built.
What is Transformer?
Before the Transformer arrived, AI systems processed language the way a person might read a sentence with a very short memory — word by word, in sequence, with earlier words fading in importance by the time the end of the sentence arrived. This worked well enough for short, simple text but struggled with longer passages where the meaning of a word depended on something said much earlier. The Transformer solved this with a mechanism called attention, which allows the model to look at all the words in a passage simultaneously and work out which ones are most relevant to each other — regardless of how far apart they appear. A pronoun near the end of a paragraph can be connected back to the noun it refers to at the beginning, in a single step.
This architectural shift turned out to be one of the most significant in the history of AI. The Transformer did not just improve language processing — it made it practical to train AI systems on vastly larger amounts of text than had been possible before, a shift enabled by advances in hardware and computational scale as much as the architecture itself. That combination made it possible to build the large language models that power today's AI assistants. ChatGPT, Claude, Gemini, and most modern language systems run on a Transformer architecture or a direct descendant of it.
What makes the Transformer particularly significant is that its usefulness turned out not to be limited to language. Researchers have since applied the same attention-based architecture to images, audio, video, and scientific data, with strong results across all of them. The model category that most people associate with AI writing and conversation turns out to be a general-purpose pattern for learning from almost any kind of structured data. That breadth is a large part of why the Transformer has become the dominant architectural choice across so many areas of modern AI research and development.
Real-world example
When you ask an AI assistant to summarize a ten-page document and it correctly identifies which paragraphs are most important to the overall argument — even when those paragraphs are scattered throughout the text — that is the Transformer's attention mechanism at work. It is not reading the document line by line and hoping key points appear near the end. It is weighing every part of the text against every other part simultaneously to work out what matters most.
Related terms
Frequently asked questions
What does the T in ChatGPT stand for?
Transformer. GPT stands for Generative Pre-trained Transformer — a name that describes both what the model does (generates text) and the architecture it is built on (the Transformer). The Transformer is the underlying structure that makes the model capable of understanding and producing language at scale.
What is self-attention?
Self-attention is the core mechanism inside a Transformer that allows it to weigh the relevance of every word in a sequence against every other word — all at once. When a model reads the sentence 'The trophy didn't fit in the suitcase because it was too big,' self-attention is what lets it work out that 'it' refers to the trophy and not the suitcase. It is the reason Transformers handle context and meaning so much more effectively than the language models that came before them.
Is the Transformer the same as a large language model?
Not exactly. The Transformer is an architecture — a way of structuring a neural network. A large language model is a system built using that architecture and trained on vast amounts of text. The relationship is similar to the difference between an engine design and the car built around it. Most large language models today use the Transformer architecture, but the Transformer itself is the underlying pattern, not the finished product.