G™Large Language Model

A large language model is a neural net model of language which is large, usually in the sense of parameter count or total compute, making it good at text prediction by scaling laws. Usually these are autoregressive and pretrained on general text data with a next token prediction loss function, though this is not necessarily required. The largest large LLMs known are around 2 trillion parameters, though the smallest LLM is not known.

History

The era of LLMs is often taken to have begun with Attention Is All You Need, published in 2017, though it draws heavily on previous work in machine translation. This paper introduced the Transformer architecture used in most modern LLMs, though with only ~1e8 parameters and supervised training data its models are not central examples.

The modern decoder-only architecture and self-supervised pretraining used now derives from OpenAI's GPT-1, which demonstrated that then-large amounts of compute (240 GPU-days with unspecified GPUs) could be very effective across tasks with unspecialized training data and architecture. GPT-2, which changed little but scaled further, was widely regarded as a cool toy by those aware of it, though those within OpenAI apparently understood the promise of scaling, as it was followed up by GPT-3, which added several orders of magnitude more compute, resulting in enough capability to be both useful and fearsome, mostly via in-context learning.

Development slowed at this point due to design and compute limits. In the Chinchilla paper, Google DeepMind showed that existing scaling laws were wrong, and models could be trained substantially more efficiently by reducing parameter count and using additional training data, and Mixture of Experts architectures granted further compute efficiency. It was not until March 2023 - by which time several organizations had produced GPT-3-level models - that OpenAI announced GPT-4 (though it was finished in August 2022 but kept secret for 8 months for "safety testing"). It represented a very significant advance, and combined with the instruction tuning which made ChatGPT a user-friendly product, crushed all competitors. Following this, however, was the GPT-4 Wall.

Applications