Description
Question 1: Health Behaviours
Consider the following causal graphical model involving three Bernoulli random variables, which is a simple model of health status and behaviours: H (health status), C (cautious behaviour), D (disease).
People’s health status influences whether they adopt cautious behaviour, and
their health status together with their behaviour influence their probability of H disease.
Consider the following 6-armed bandit problem. The initial value estimates of the arms are given by Q = {1, 2, 2, 1, 0, 3}, and the actions are represented by A = {1, 2, 3, 4, 5, 6}. Suppose we observe that each lever is played in turn: (from lever 1 to lever 6, and then start from lever 1 again):
=(( −1) 6) +1 (1)
We also observe that the rewards seem to fit the following function:
-
-
-
-
-
= 2 cos [
( − 1)]
(2)
6
-
-
-
-
So, the first two action-reward pairs are 1 = 1, 1 = 2, and 2 = 2, 2 = √3.
-
Show the estimated Q values from =1 to =12 of the trajectory using the average of the observed rewards, where available. Do not consider the initial estimates as samples.
-
It turns out the player was following an -greedy strategy, which just happened to coincide with the scheme described above in (1) for the first 12 time steps. For each time step t from 1 to 12, report whether it can be concluded with certainty that a random action was selected.
-
Suppose now we continue to visit the levers iteratively as in (1), and that the observed rewards continue to fit the pattern established by (2). Is there a limiting expected reward ∗( ) for each action ∈ as approaches infinity? Justify your answer.