Name: HW6: Reinforcement Learning Solution
SKU: 6553
Price: 30.00 USD
Availability: InStock

HW6: Reinforcement Learning Solution

~~$30.00~~ $24.00

TD and Q in Blockworld Consider the following gridworld: Suppose that we run two episodes that yield the following sequences of (state, action, reward) tuples: S A R S A R (1,1) up -1 (1,1) up -1 (2,1) left -1 (1,2) up -1 (1,1) up -1 (1,3) right -1 (1,2) up -1 (2,3) right -1…

Description

5/5 – (2 votes)

TD and Q in Blockworld

Consider the following gridworld:

Suppose that we run two episodes that yield the following sequences of (state, action, reward) tuples:

S	A	R	S	A	R

(1,1)	up	-1	(1,1)	up	-1
(2,1)	left	-1	(1,2)	up	-1
(1,1)	up	-1	(1,3)	right	-1
(1,2)	up	-1	(2,3)	right	-1
(1,3)	up	-1	(2,3)	right	-1
(2,3)	right	-1	(3,3)	right	-1
(3,3)	right	-1	(4,3)	exit	+100
(4,3)	exit	+100	(done)
(done)

According to model-based learning, what are the transition probabilities for every (state, action, state) triple. Don’t bother listing all the ones that we have no information about.

What would the Q-value estimate be if SARSA were run to generate these same trajectories? As-sume all Q-value estimates start at 0, a discount factor of 0:9 and a learning rate of 0:5. Again, don’t bother listing all of the cases where we don’t have data.

Suppose that we run Q-learning. However, instead of initializing all our Q values to zero, we initial-ize them to some large positive number (\large” with respect to the maximum reward possible in the world: say, 10 times the max reward). I claim that this will cause a Q-learning agent to initially ex-plore a lot and then eventually start exploiting. Why should this be true? Justify your answer in a short paragraph.

HW6: Reinforcement Learning Solution

Share this:

Share this:

Description

Share this:

Related products

Lab 2 File Management System Calls Solution

Assignment-2 Solution

Assignment-1 Solution

Assignment 2 Solution

ASSIGNMENT-02 Solution