Lecture Slides

Lecture Notes

A "sparse" reward system only indicates wins or losses. This isn't very helpful since it only evaluates one step ahead, making it unsuitable for long-term planning. That's why we use a "denser" reward system with specific heuristics. While this approach provides more learning supervision, it can lead to reward hacking.

image.png

The value function helps distribute rewards across multiple states rather than concentrating them only at the end state. This means that instead of only getting feedback when reaching a final outcome (like winning or losing), the value function helps assign meaningful values to intermediate states and actions that lead to those outcomes. This makes learning more effective since the system can understand which steps along the way are beneficial, rather than only learning from the final result.

This is particularly important in the initial learning stages when all rewards might be zero. Without a discriminative value function, the agent would have no way to differentiate between different states or actions, making it impossible to learn which paths might lead to better outcomes. The value function helps create a gradient that guides the learning process, even when immediate rewards aren't available.

The Value function

The Value function V(s) measures how good a state is (depends on policy)

$$ v(s) = E[r + \gamma v(s')] $$

where:

Policy

Probability of action per state. A probabilistic mapping of state to action. (the circled part below is the policy)