The original Transformer is based on the encoder-decoder architecture (e.g. tasks like machine translation, where a sequence of words is translated from one language to another.)

Encoder Converts an input sequence of tokens ⇒ a sequence of embedding vectors (hidden state or context)

Receives a sequence of embeddings and feeds them through the following sublayers:
Decoder Encoder’s hidden state to iteratively generate an output sequence of tokens, one token at a time
