Introduction to Transformers

The original Transformer is based on the encoder-decoder architecture (e.g. tasks like machine translation, where a sequence of words is translated from one language to another.)

Encoder Converts an input sequence of tokens ⇒ a sequence of embedding vectors (hidden state or context)

Receives a sequence of embeddings and feeds them through the following sublayers:
- A multi-head self-attention layer
- A fully connected feed-forward layer that is applied to each input embedding
Decoder Encoder’s hidden state to iteratively generate an output sequence of tokens, one token at a time

Introduction to Transformers

Encoder

1. Positional Embedding

2. Self Attention

3. Multi-Head Attention

Resource