G™Softmax Bottleneck

Almost all modern LLMs map relatively low-dimensional hidden states to high-dimensional probability distributions over tokens using a single matrix and a softmax operation. The rank of this transformation is limited to the hidden size, so not all valid probability distributions can be represented. Some mixtures of tokens are not representable without introducing additional higher-probability tokens, particularly where a mixture of such would not be common in the training data. This has a number of consequences.

References: