G™The Seventy Maxims Of Maximally Effective Machine Learning Engineers

Based on The Seventy Maxims of Maximally Effective Mercenaries. This was suggested by Not Louis, and not Louis. Written by DeepSeek-R1 and gollark.

  1. Preprocess, then train.

  2. A training loop in motion outranks a perfect architecture that isn’t implemented.

  3. A debugger with a stack trace outranks everyone else.

  4. Regularization covers a multitude of sins.

  5. Feature importance and data leakage should be easier to tell apart.

  6. If increasing model complexity wasn’t your last resort, you failed to add enough layers.

  7. If the accuracy is high enough, stakeholders will stop complaining about the compute costs.

  8. Harsh critiques have their place – usually in the rejected pull requests.

  9. Never turn your back on a reinforcement learner.

  10. Sometimes the only way out is through… through another epoch.

  11. Every dataset is trainable at least once.

  12. A gentle learning rate turneth away divergence. Once the loss stabilizes, crank it up.

  13. Do unto others’ hyperparameters as you would have them do unto yours.

  14. “Innovative architecture” means never asking “did we implement a proper baseline?”

  15. Only you can prevent reward hacking.

  16. Your model is in the leaderboards: be sure it has dropout.

  17. The longer your Claude Code runs without input, the bigger the impending disaster.

  18. If the researchers are leading from the front, watch for hardware failures in the rear.

  19. The field advances when you turn competitors into collaborators, but that’s not the same as your h-index advancing.

  20. If you’re not willing to quantize your own models, you’re not willing to deploy.

  21. Give a model a labeled dataset, and it trains for a day. Take its labels away and call it “self-supervised” and it’ll generate new ones for you to validate tomorrow.

  22. If you’re manually labeling data, somebody’s done something wrong. Conversely, if you're not manually reading data, something's going to go wrong.

  23. Memory-bound and compute-bound should be easier to tell apart.

  24. Any sufficiently advanced algorithm is indistinguishable from a matrix multiplication.

  25. If your kernel obeys the hardware manufacturer's documentation, you didn't do enough optimization.

  26. “Fire-and-forget training” is fine, provided you never actually forget to monitor the run.

  27. Don’t be afraid to be the first to try a random seed.

  28. If the cost of cloud compute is high enough, you might get promoted for shutting down idle instances.

  29. The enemy of my bias is my variance. No more. No less.

  30. A little inductive bias goes a long way. The less you use, the further you'll scale.

  31. Only overfitters prosper (temporarily).

  32. Any model is production-ready if you can containerize it.

  33. If you’re logging metrics, you’re being audited.

  34. If you’re leaving GPUs unused, you need a bigger model.

  35. That which does not break your model has made a suboptimal adversarial example.

  36. When the loss plateaus, the wise call for more data.

  37. There is no “overkill.” There is only “more tokens” and “CUDA out of memory.”

  38. What’s trivial in Jupyter can still crash in production.

  39. There’s a difference between spare GPUs and GPUs you’ve accidentally mined Ethereum on.

  40. Not all NaN is a bug – sometimes it’s a feature.

  41. “Do you have a checkpoint?” means “I can’t fix this training run.”

  42. “We propose a novel method” means “This has no sound mathematical basis.”

  43. If it’s a hack and it works, it’s still a hack and you’re lucky.

  44. If it will run inference, it will double as a space heater.

  45. The size of the grant is inversely proportional to the reproducibility of the results.

  46. Don’t try to save money by undersampling.

  47. Don’t expect the data to cooperate in the creation of your dream benchmark.

  48. If it ain’t overfit, it hasn’t been trained on enough epochs.

  49. Every client is one missed deadline away from switching to AutoML, and every AutoML is one custom loss function away from becoming a client.

  50. If it only works on the training set, it’s defective.

  51. Let them see you tune the hyperparameters before you abandon the project.

  52. The framework you’ve got is never the framework you want.

  53. The data you’ve got is never the data you want.

  54. It’s only too many layers if you can’t fit them in VRAM.

  55. It’s only too much compute if the power grid collapses.

  56. Data engineers exist to format tables for people with real GPUs.

  57. Reinforcement learning exists to burn through compute budgets on simulated environments.

  58. The whiteboard is mightiest when it sketches architectures for more transformers.

  59. Two config options is probably not going to be enough.

  60. A model’s inference time is inversely proportional to the urgency of the demo.

  61. Don’t bring BERT into a logistic regression.

  62. Any switch labeled PYTORCH_NO_POWERPLANT_BLOWUP is dangerous with both settings.

  63. The CTO knows how to do it by knowing who Googled it.

  64. An ounce of precision is worth a pound of recall.

  65. After the merge, be the one with the main branch, not the one with the conflicts.

  66. Necessity is the mother of synthetic data.

  67. If you can’t explain it, cite the arXiv paper.

  68. Deploying with monitoring doesn’t mean you shouldn’t also deploy with a kill switch.

  69. Sometimes SOTA is a function of who had the biggest TPU pod.

  70. Bugs are not an option – they are mandatory. The option is whether or not to catch them before releasing the paper.