Description
-
TD and Q in Blockworld
Consider the following gridworld:
Suppose that we run two episodes that yield the following sequences of (state, action, reward) tuples:
S |
A |
R |
S |
A |
R |
(1,1) |
up |
-1 |
(1,1) |
up |
-1 |
(2,1) |
left |
-1 |
(1,2) |
up |
-1 |
(1,1) |
up |
-1 |
(1,3) |
right |
-1 |
(1,2) |
up |
-1 |
(2,3) |
right |
-1 |
(1,3) |
up |
-1 |
(2,3) |
right |
-1 |
(2,3) |
right |
-1 |
(3,3) |
right |
-1 |
(3,3) |
right |
-1 |
(4,3) |
exit |
+100 |
(4,3) |
exit |
+100 |
(done) |
||
(done) |
|||||
-
According to model-based learning, what are the transition probabilities for every (state, action, state) triple. Don’t bother listing all the ones that we have no information about.
-
What would the Q-value estimate be if SARSA were run to generate these same trajectories? As-sume all Q-value estimates start at 0, a discount factor of 0:9 and a learning rate of 0:5. Again, don’t bother listing all of the cases where we don’t have data.
-
Suppose that we run Q-learning. However, instead of initializing all our Q values to zero, we initial-ize them to some large positive number (\large” with respect to the maximum reward possible in the world: say, 10 times the max reward). I claim that this will cause a Q-learning agent to initially ex-plore a lot and then eventually start exploiting. Why should this be true? Justify your answer in a short paragraph.
1