Based on [[https://schlockmercenary.fandom.com/wiki/The_Seventy_Maxims_of_Maximally_Effective_Mercenaries|The Seventy Maxims of Maximally Effective Mercenaries]]. This was suggested by Not Louis, and not Louis. Written by [[DeepSeek-R1]] and [[gollark]].

*. Preprocess, then train.
*. A training loop in motion outranks a perfect architecture that isn’t implemented.
*. A debugger with a stack trace outranks everyone else.
*. Regularization covers a multitude of sins.
*. Feature importance and data leakage should be easier to tell apart.
*. If increasing model complexity wasn’t your last resort, you failed to add enough layers.
*. If the accuracy is high enough, stakeholders will stop complaining about the compute costs.
*. Harsh critiques have their place – usually in the rejected pull requests.
*. Never turn your back on a reinforcement learner.
*. Sometimes the only way out is through… through another epoch.
*. Every dataset is trainable at least once.
*. A gentle learning rate turneth away divergence. Once the loss stabilizes, crank it up.
*. Do unto others’ hyperparameters as you would have them do unto yours.
*. “Innovative architecture” means never asking “did we implement a proper baseline?”
*. Only you can prevent reward hacking.
*. Your model is in the leaderboards: be sure it has dropout.
*. The longer your Claude Code runs without input, the bigger the impending disaster.
*. If the researchers are leading from the front, watch for hardware failures in the rear.
*. The field advances when you turn competitors into collaborators, but that’s not the same as your h-index advancing.
*. If you’re not willing to quantize your own models, you’re not willing to deploy.
*. Give a model a labeled dataset, and it trains for a day. Take its labels away and call it “self-supervised” and it’ll generate new ones for you to validate tomorrow.
*. If you’re manually labeling data, somebody’s done something wrong. Conversely, if you're not manually reading data, something's going to go wrong.
*. Memory-bound and compute-bound should be easier to tell apart.
*. Any sufficiently advanced algorithm is indistinguishable from a matrix multiplication.
*. If your kernel obeys the hardware manufacturer's documentation, you didn't do enough optimization.
*. “Fire-and-forget training” is fine, provided you never actually forget to monitor the run.
*. Don’t be afraid to be the first to try a random seed.
*. If the cost of cloud compute is high enough, you might get promoted for shutting down idle instances.
*. The enemy of my bias is my variance. No more. No less.
*. A little inductive bias goes a long way. The less you use, the further you'll scale.
*. Only overfitters prosper (temporarily).
*. Any model is production-ready if you can containerize it.
*. If you’re logging metrics, you’re being audited.
*. If you’re leaving GPUs unused, you need a bigger model.
*. That which does not break your model has made a suboptimal adversarial example.
*. When the loss plateaus, the wise call for more data.
*. There is no “overkill.” There is only “more tokens” and “CUDA out of memory.”
*. What’s trivial in Jupyter can still crash in production.
*. There’s a difference between spare GPUs and idle GPUs.
*. Not all NaN is a bug – sometimes it’s a feature.
*. “Do you have a checkpoint?” means “I can’t fix this training run.”
*. “We propose a novel method” means “This has no sound mathematical basis.”
*. If it’s a hack and it works, it’s still a hack and you’re lucky.
*. If it will run inference, it will double as a space heater.
*. The size of the grant is inversely proportional to the reproducibility of the results.
*. Don’t try to save money by undersampling.
*. Don’t expect the data to cooperate in the creation of your dream benchmark.
*. If it ain’t overfit, it hasn’t been trained on enough epochs.
*. Every client is one missed deadline away from switching to AutoML, and every AutoML is one custom loss function away from becoming a client.
*. If it only works on the training set, it’s defective.
*. Let them see you tune the hyperparameters before you abandon the project.
*. The framework you’ve got is never the framework you want.
*. The data you’ve got is never the data you want.
*. It’s only too many layers if you can’t fit them in VRAM.
*. It’s only too much compute if the power grid collapses.
*. Data engineers exist to format tables for people with real GPUs.
*. Reinforcement learning exists to burn through compute budgets on simulated environments.
*. The whiteboard is mightiest when it sketches architectures for more transformers.
*. Two config options is probably not going to be enough.
*. A model’s inference time is inversely proportional to the urgency of the demo.
*. Don’t bring BERT into a logistic regression.
*. Any switch labeled `PYTORCH_NO_POWERPLANT_BLOWUP` is dangerous with either setting.
*. The CTO knows how to do it by knowing who Googled it.
*. An ounce of precision is worth a pound of recall.
*. After the merge, be the one with the main branch, not the one with the conflicts.
*. Necessity is the mother of synthetic data.
*. If you can’t explain it, cite the arXiv paper.
*. Deploying with monitoring doesn’t mean you shouldn’t also deploy with a kill switch.
*. Sometimes SOTA is a function of who had the biggest TPU pod.
*. Bugs are not an option – they are mandatory. The option is whether or not to catch them before releasing the paper.