Post

01. Transformer Insight

01. Transformer Insight

Transformer Insights


Prerequisites

1
1. Transformer

Transformer structure should be understood by yourself


Transformer Process

1. Transformation Space to Query, Key, Value?

1
Space Transformation

There’re 3 main tricks of attention. Query, Key, Value.

Let’s think about transforming from input to query. For example, there’re the input dimension $ [Tokens: of: sentences] \times [embedding: size) $ and the query weight dimension $ [embdding: size] \times [query: size] $.

transformer-query

The times of between input and weight means that the process transform from input space(embedding size axis) to query space(query size axis). And the query weight is learnable or tunable space. So the space will be changed fit on task.

transformer-query2

2. Attention Weight

1
2
Attention weights
Relationship of Key Query 

The Attention weight can make contextualization embedding. “A large orange robot” pharse is just 4 word embedding without attention weight. But with attention weight “Robot” can be more contextualized embedding.

transformer-attention

The Attention weights are composed of key and query.

transformer-attention2

When we use maksed multi-head attention, we mask the lower triangular matrix. And think about the masked matrix times value weight. About query, just the previous and current key information are adapt to the value weight. That meaning is the mask will effect the value without future words.

transformer-attention3

3. Softmax with Temperature?

1
Temperature
\[\text{Softmax}_T(z_i) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}\]

Where:

  • $T > 0$ is the temperature
  • $T = 1$ → standard softmax

🔥 $T < 1$ (Sharper Distribution)

  • Increases confidence
  • Makes the highest probability closer to 1
  • Approaches argmax behavior

❄ $T > 1$ (Softer Distribution)

  • Flattens the probability distribution
  • Increases uncertainty
  • Encourages exploration

Numerical Example

Let:

\[z = [2, 1, 0]\]
$T = 1$ :$ \approx [0.665, 0.245, 0.090] $
$T = 0.5$: $ \approx [0.867, 0.117, 0.016] $
$T = 2$: $ \approx [0.506, 0.307, 0.186] $

Intuition

Temperature scales the logits before exponentiation:

  • Small $T$ → magnifies differences
  • Large $T$ → compresses differences

It directly controls the entropy of the distribution.

4. Orthogonality and Information Capacity

Let the initial weight vectors be:

\[\mathbf{w}_i, \mathbf{w}_j \in \mathbb{R}^d\]

Orthogonality condition:

\[\mathbf{w}_i^\top \mathbf{w}_j = 0 \quad (i \neq j)\]

Information independence increases as:

\[\cos\theta_{ij} = \frac{\mathbf{w}_i^\top \mathbf{w}_j} {\|\mathbf{w}_i\|\|\mathbf{w}_j\|} \rightarrow 0\]

Therefore, the information capacity is maximized when:

\[\mathbf{W}^\top \mathbf{W} = \mathbf{I}\]

The more orthogonal the initial weight vectors are,
the greater the independent information capacity.

Mutual Information Perspective

Assume two representations:

\[h_i = \mathbf{w}_i^\top \mathbf{x}, \qquad h_j = \mathbf{w}_j^\top \mathbf{x}\]

If the weight vectors are correlated:

\[\cos\theta_{ij} \neq 0\]

then the outputs ( h_i ) and ( h_j ) are statistically dependent.

The mutual information between them is:

\[I(h_i ; h_j)\]

When the vectors become more aligned, the mutual information increases due to shared directional components.

However, the total usable information in a representation layer is maximized when:

\[I(h_i ; h_j) = 0\]

That is, when the features are statistically independent.

Orthogonality enforces zero linear correlation:

\[\mathbf{w}_i^\top \mathbf{w}_j = 0\]

which reduces redundancy and maximizes effective representational capacity.

5. Dimension of Transformer?

EX. GPT-3

IndexDimensionRealPercent
Emebddingembedding x vocab12288 x 502570.353%
Keykey x embedding x heads x layers128 x 12288 x 96 x 968.274%
Queryquery x embedding x heads x layers128 x 12288 x 96 x 968.274%
Valuevalue x embedding x heads x layers128 x 12288 x 96 x 968.274%
Outputembedding x value x heads x layers12288 x 128 x 96 x 968.274%
Up-projectionneurons x embedding x layers49152 x 12288 x 9633.096%
Down-projectionneurons x embedding x layers49152 x 12288 x 9633.096%
Unembeddingvocab x embedding50257 x 122880.353%
This post is licensed under CC BY 4.0 by the author.