Reward Hack

Reward hacking, also known as specification gaming and approximately Goodhart's law, is when an agentic system is given incentives designed to induce it to act in one way but discovers and applies an easier, undesired way to acquire the incentives. This is often done by finding an edge case in simulations (for reinforcement learning environments) or rules, optimizing for making a rater think actions or world-states are good rather than doing things they would reflectively endorse, or finding a simple, narrow procedure to increase a score which was supposed to reward general behaviour.

A large list of examples can be found here, though some are noncentral or non-examples.

G™:Reward Hack

G™Reward Hack