Lecture Slides

Lecture Notes

A "sparse" reward system only indicates wins or losses. This isn't very helpful since it only evaluates one step ahead, making it unsuitable for long-term planning. That's why we use a "denser" reward system with specific heuristics. While this approach provides more learning supervision, it can lead to reward hacking.

The value function helps distribute rewards across multiple states rather than concentrating them only at the end state. This means that instead of only getting feedback when reaching a final outcome (like winning or losing), the value function helps assign meaningful values to intermediate states and actions that lead to those outcomes. This makes learning more effective since the system can understand which steps along the way are beneficial, rather than only learning from the final result.

This is particularly important in the initial learning stages when all rewards might be zero. Without a discriminative value function, the agent would have no way to differentiate between different states or actions, making it impossible to learn which paths might lead to better outcomes. The value function helps create a gradient that guides the learning process, even when immediate rewards aren't available.

The Value function

The Value function V(s) measures how good a state is (depends on policy)

Corresponds to a state
defined for every state
depends on policy or environment

$$ v(s) = E[r + \gamma v(s')] $$

where:

v(s) is the value of state s
E is the expected value
r is the immediate reward
γ (gamma) is the discount factor: This determines how we value future rewards versus immediate ones. When gamma equals 1, rewards are valued equally whether they occur now or later. It's typically set to 0.9 to encourage faster completion of the game.
v(s') is the value of the next state

Policy

Probability of action per state. A probabilistic mapping of state to action. (the circled part below is the policy)