Transformers, Simply Explained

Autobots or Decepticons?

All transformers really do is fill in the blanks and autocomplete.

Salman Naqvi


Tuesday, 28 February 2023

A picture of Transformers — the ones that transform from a robot into cars — posing as English alphabets.

At a High View

Transformers are all the rage right now. They’re what’s powering the current wave of chat bots. Here’s a high level view of how transformers work, so you know how these bots really work.

Simply put, a transformer is a type of architecture used for Natural Language Processing (NLP) tasks that either fills-in-the-blanks or autocompletes.

Transformers consist of either an encoder, decoder, or both. Encoders and decoders contain attention layers.

Language models need numbers to work. Text is given a numerical representation after breaking it down into smaller pieces. To keep this explanation simple, these pieces are words.

The numerical representation given to a word describes the word itself and its relation to the surrounding words.


Encoder-only transformers are good for “understanding” text, such as classifying sentences by sentiment or figuring out what parts of a sentence refers, for example, to a person or location.

When training encoders, words are given a numerical representation by the attention layers considering adjacent words. For example, let’s say we have the sentence, “I am really hungry.”. The attention layers consider the words ‘am’ and ‘hungry’ when giving the word ‘really’ a numerical representation.

The goal of training encoders is to predict words omitted from text (e.g., “I … really hungry.”). This is how encoders can “understand” text.


Decoder-only transformers are good for text generation. An example is the autocomplete feature on a smartphone’s keyboard.

Decoders similary give text a numerical representation, except that the attention layers consider only the previous words. When giving ‘am’ a numerical representation from “I am hungry.”, the attention layers will only consider the word ‘I’. When giving ‘hungry’ a numerical representation, the attention layers will consider the words ‘I’ and ‘am’.

The goal of training decoders is to predict the most likely word to continue a piece of text (e.g., “I am ….”). All generated words are used in conjunction to generate the next word.

Encoders and Decoders

Transformers that use both encoders and decoders are known as encoder-decoder models or sequence-to-sequence models. Such models are good for translation and summarization.

Encoder-decoder models are trained by first letting the encoder give the input text a numerical representation. Next, this representation is input to the decoder which generates text as described above. The encoder part of the model provides the “understanding”, while the decoder part of the model generates based off of this “understanding”.

Closing Words

And there you have it! It’s as simple as that!

If you have any comments, questions, suggestions, feedback, criticisms, or corrections, please do post them down in the comment section below!