softmax_bottleneck at 0bda0e4

Almost all modern LLMs map relatively low-dimensional hidden states to high-dimensional probability distributions over tokens using a single matrix and a softmax operation. The rank of this transformation is limited to the hidden size, so not all valid probability distributions can be represented. This has a number of consequences.

References:

https://x.com/kalomaze/status/1776341569542431150
https://aclanthology.org/2022.acl-long.554/

G™:Softmax Bottleneck

G™Softmax Bottleneck