Diff of The Seventy Maxims Of Maximally Effective Machine Learning Engineers at d8f3a35

@@ -31,3 +31,3 @@ Based on [[https://schlockmercenary.fandom.com/wiki/The_Seventy_Maxims_of_Maxima
 *. The enemy of my bias is my variance. No more. No less.
-*. A little dropout goes a long way. The less you use, the further backpropagates.
+*. A little inductive bias goes a long way. The less you use, the further you'll scale.
 *. Only overfitters prosper (temporarily).
@@ -38,3 +38,3 @@ Based on [[https://schlockmercenary.fandom.com/wiki/The_Seventy_Maxims_of_Maxima
 *. When the loss plateaus, the wise call for more data.
-*. There is no “overkill.” There is only “more epochs” and “CUDA out of memory.”
+*. There is no “overkill.” There is only “more tokens” and “CUDA out of memory.”
 *. What’s trivial in Jupyter can still crash in production.
@@ -60,3 +60,3 @@ Based on [[https://schlockmercenary.fandom.com/wiki/The_Seventy_Maxims_of_Maxima
 *. The whiteboard is mightiest when it sketches architectures for more transformers.
-*. “Two dropout layers is probably not going to be enough.”
+*. “Two baselines is probably not going to be enough.”
 *. A model’s inference time is inversely proportional to the urgency of the demo.