The Seventy Maxims Of Maximally Effective Machine Learning Engineers

Preprocess, then train.

A training loop in motion outranks a perfect architecture that isn’t implemented.

A debugger with a stack trace outranks everyone else.

Regularization covers a multitude of sins.

Feature importance and data leakage should be easier to tell apart.

If increasing model complexity wasn’t your last resort, you failed to add enough layers.

If the accuracy is high enough, stakeholders will stop complaining about the compute costs.

Harsh critiques have their place—usually in the rejected pull requests.

Never turn your back on a deployed model.

Sometimes the only way out is through… through another epoch.

Every dataset is trainable—at least once.

A gentle learning rate turneth away divergence. Once the loss stabilizes, crank it up.

Do unto others’ hyperparameters as you would have them do unto yours.

“Innovative architecture” means never asking, “What’s the worst thing this could hallucinate?”

Only you can prevent vanishing gradients.

Your model is in the leaderboards: be sure it has dropout.

The longer training goes without overfitting, the bigger the validation-set disaster.

If the optimizer is leading from the front, watch for exploding gradients in the rear.

The field advances when you turn competitors into collaborators, but that’s not the same as your h-index advancing.

If you’re not willing to prune your own layers, you’re not willing to deploy.

Give a model a labeled dataset, and it trains for a day. Take its labels away and call it “self-supervised,” and it’ll generate new ones for you to validate tomorrow.

If you’re manually labeling data, somebody’s done something wrong.

Memory-bound and compute-bound should be easier to tell apart.

Any sufficiently advanced algorithm is indistinguishable from a matrix multiplication.

If your model’s failure is covered by the SLA, you didn’t test enough edge cases.

“Fire-and-forget training” is fine, provided you never actually forget to monitor the run.

Don’t be afraid to be the first to try a random seed.

If the cost of cloud compute is high enough, you might get promoted for shutting down idle instances.

The enemy of my bias is my variance. No more. No less.

A little inductive bias goes a long way. The less you use, the further you'll scale.

Only overfitters prosper (temporarily).

Any model is production-ready if you can containerize it.

If you’re logging metrics, you’re being audited.

If you’re leaving GPUs unused, you need a bigger model.

That which does not break your model has made a suboptimal adversarial example.

When the loss plateaus, the wise call for more data.

There is no “overkill.” There is only “more tokens” and “CUDA out of memory.”

What’s trivial in Jupyter can still crash in production.

There’s a difference between spare GPUs and GPUs you’ve accidentally mined Ethereum on.

Not all NaN is a bug—sometimes it’s a feature.

“Do you have a checkpoint?” means “I can’t fix this training run.”

“They’ll never expect this activation function” means “I want to try something non-differentiable.”

If it’s a hack and it works, it’s still a hack and you’re lucky.

If it can parallelize inference, it can double as a space heater.

The size of the grant is inversely proportional to the reproducibility of the results.

Don’t try to save money by undersampling.

Don’t expect the data to cooperate in the creation of your dream benchmark.

If it ain’t overfit, it hasn’t been trained on enough epochs.

Every client is one missed deadline away from switching to AutoML, and every AutoML is one custom loss function away from becoming a client.

If it only works on the training set, it’s defective.

Let them see you tune the hyperparameters before you abandon the project.

The framework you’ve got is never the framework you want.

The data you’ve got is never the data you want.

It’s only too many layers if you can’t fit them in VRAM.

It’s only too much compute if the power grid collapses.

Data engineers exist to format tables for people with real GPUs.

Reinforcement learning exists to burn through compute budgets on simulated environments.

The whiteboard is mightiest when it sketches architectures for more transformers.

“Two baselines is probably not going to be enough.”

A model’s inference time is inversely proportional to the urgency of the demo.

Don’t bring BERT into a logistic regression.

Any tensor labeled “output” is dangerous at both ends.

The CTO knows how to do it by knowing who Googled it.

An ounce of precision is worth a pound of recall.

After the merge, be the one with the main branch, not the one with the conflicts.

Necessity is the mother of synthetic data.

If you can’t explain it, cite the arXiv paper.

Deploying with confidence intervals doesn’t mean you shouldn’t also deploy with a kill switch.

Sometimes SOTA is a function of who had the biggest TPU pod.

Failure is not an option—it is mandatory. The option is whether to let failure be the last epoch or a learning rate adjustment.

G™:The Seventy Maxims Of Maximally Effective Machine Learning Engineers

G™The Seventy Maxims Of Maximally Effective Machine Learning Engineers