Diff of Large Language Model at a93594a

@@ -6,3 +6,3 @@ The era of LLMs is often taken to have begun with [[https://arxiv.org/abs/1706.0
 
-The modern decoder-only architecture and [[self-supervised]] pretraining used now derives from [[https://openai.com/index/language-unsupervised/|OpenAI's GPT-1]], which demonstrated that then-large amounts of compute (240 GPU-days with unspecified GPUs) could be very effective across tasks with unspecialized training data and architecture. [[https://openai.com/index/better-language-models/|GPT-2]], which changed little but scaled further, was widely regarded as a [[cool toy]] by those aware of it, though those within [[OpenAI]] apparently understood the promise of scaling, as it was followed up by [[https://arxiv.org/abs/2005.14165|GPT-3]], which added several orders of magnitude more compute, resulting in [[https://gwern.net/gpt-3#what-benchmarks-miss-demos|enough capability]] to be both [[useful]] and [[fearsome]].
+The modern decoder-only architecture and [[self-supervised]] pretraining used now derives from [[https://openai.com/index/language-unsupervised/|OpenAI's GPT-1]], which demonstrated that then-large amounts of compute (240 GPU-days with unspecified GPUs) could be very effective across tasks with unspecialized training data and architecture. [[https://openai.com/index/better-language-models/|GPT-2]], which changed little but scaled further, was widely regarded as a [[cool toy]] by those aware of it, though those within [[OpenAI]] apparently understood the promise of scaling, as it was followed up by [[https://arxiv.org/abs/2005.14165|GPT-3]], which added several orders of magnitude more compute, resulting in [[https://gwern.net/gpt-3#what-benchmarks-miss-demos|enough capability]] to be both [[useful]] and [[fearsome]], mostly via [[in-context learning]].