HW7: Policy Search Solution

$30.00 $24.00

Policy Gradient In order to do policy gradient, we need to be able to compute the gradient of the value function J with respect to a parameter vector : r J( ). By our algebraic magic, we expressed this as: X r J( ) = (s0; a)R(a) r log ( (s0; a)) (1) a |…

5/5 – (2 votes)

You’ll get a: zip file solution

 

Description

5/5 – (2 votes)
  • Policy Gradient

In order to do policy gradient, we need to be able to compute the gradient of the value function J with respect to a parameter vector : r J( ). By our algebraic magic, we expressed this as:

X

r J( ) = (s0; a)R(a) r log ( (s0; a)) (1)

a

| {z }

g(s0;a)

If we us a linear function thrown through a soft-max as our stochastic policy, we have:

(s; a) =

a0

expP(

in=1 ifi(s; a0 ))

(2)

exp (

in=1

ifi(s; a))

P

P

Compute a closed form solution for g(s0; a). Explain in a few sentences why this leads to a sensible update for gradience ascent (i.e., if we plug this in to Eq (1) and do gradient ascent, why is the derived form rea-sonable)?

r log ( (s0; a))

exp(

in=1 ifi(s;a))

r

log

P

a

0

expP(

in=1 ifi(s;a0 ))

P

(s;a0 )+::: nfn(s;a0 ))

r log

P

a0

exp( 1f1

exp( 1f1

(s;a0 )+::: nfn(s;a0 ))

Pexp( 1f1(s;a0 )+::: nfn(s;a0 ))

r

Pa0

exp( 1f1

(s;a0 )+::: nfn(s;a0 ))

a0 exp( 1f1(s;a0 )+::: nfn(s;a0 ))

exp( ifi(s;a0 )+::: nfn(s;a0 ))

2

f1(s;a)e 1f1(s;a)

3

a0 exp(

in=1 ifi(s;a))

a0

f

1

(s;a)e 1f1(s;a0 )

.

P

.

exp

(

n

f

(s;a)

)

.

P

=1

i

i

6

fn(s;a)e nfn(s;a)

7

iP

P

6

7

a0

fn(s;a)e

nfn

(s;a

)

6

0

7

4

P

5

1

(s;a)

  • 3

e 1f1(s;a)

6

P

a0 e 1.f1(s;a0 )

7

..

6

7

6

7

  • 5

e nfn(s;a)

Pa0 e nfn(s;a0 )

2

6

P

This gives us r J( ) = a R(a) 6 6

4

e 1f1(s;a)

Pa0 e 1f1(s;a0 )

.

.

.

e nfn(s;a)

Pa0 e nfn(s;a0 )

3

7

7. This gives essentially the expected reward for all ac-

7

5

tions, weighted by the importance of a feature in regards to every other weight, in respect to each weight.

1

This is a sensible update because the resultant vector added to the original will move J closer to more pos-itive rewards. Additionally, attributes with a relatively larger weight will grow more quickly, moving the most important weights closer to their convergence.

2

HW7: Policy Search Solution
$30.00 $24.00