Artificial Intelligence Homework #3 Solution

Description

5/5 – (2 votes)

Problem 1 (Markov Decision Processes) – 6 Points: Annie is a 5-year old girl who loves eating candy and is ambivalent regarding vegetables. She can either choose to eat candy (Hershey’s, Skittles, Peanut Butter Cups) or eat vegetables during every meal. Eating candy gives her +10 in happiness points, while eating vegetables only gives her +4 happiness points. But if she eats too much candy while sick, her teeth will all fall out (she won’t be able to eat any more). Annie will be in one of three states: healthy, sick, and toothless. Eating candy tends to make Annie sick, while eating vegetables tends to keep Annie healthy. If she eats too much candy, she’ll be toothless and won’t eat anything else. The transitions are shown in the table below.

Health condition	Candy or Vegetables?	Next condition	Probability
healthy	vegetables	healthy	1
healthy	candy	healthy	1/4
healthy	candy	sick	3/4
sick	vegetables	healthy	1/4
sick	vegetables	sick	3/4
sick	candy	sick	7/8
sick	candy	toothless	1/8

(1 Point) Model this problem as a Markov Decision Process: formally spec-ify each state, action, and transition T (s, a, s⁰) and reward R(a) functions.

(1 Point) Write down the Value function V (s) for this problem in all possible states under the following policies: π₁ in which Annie always eats candy and π₂ in which Annie always eats vegetables. The discount factor can be expressed as γ.

(1 Point) Start with a policy in which Annie always eats candy no matter what the her health condition is. Simulate the first two iterations of the policy iteration algorithm. Show how the policy evolves as you run the algorithm. What is the policy after the third iteration? Set γ = 0.9.

(3 Points) Which of the following five statements are true for an MDP? Select all that apply and briefly explain why.

1. If one is using value iteration and the values have converged, the policy must have converged as well.

1. Expectimax will generally run in the same amount of time as value iteration on a given MDP.

1. For an infinite horizon MDP with a finite number of states and actions and with a discount factor that satisfies 0 < γ <= 1, policy iteration is guaranteed to converge.

1. There may be more than one optimal value function.

1. There may be more than one optimal policy.

(1 Point) Initialize Q to 0 for all (state, action) pairs for q-learning. With α = 0.5, compute Q after seeing: [(S1,A1,S2) -> (S2,A2,S4)] [(S1,A2,S3) -> (S3,A1,S5)] [(S1,A2,S2) -> (S2,A1,S4)]

Problem 3 (Beam Search) – 2 Points: Ant colony optimization is an optimization technique that was inspired by the foraging behavior of real ant colonies. When searching for food, ants initially explore the area surrounding their nest in a random manner. Ants communicate indirectly by means of chemical pheromone trails, which help them find the shortest paths between their nest and the nearest food. While moving, they leave a trail of chemical pheromone behind them. Once an ant finds food, they vary the amount of pheromone they leave depending on the quality and quantity of food. This indirect communication between ants via pheromone trails enables them to find the shortest paths between their nest and the food sources.

Describe (in words) how the ant colony optimization problem can be modeled through a local beam search algorithm. Indicate what k stands for and how the ants find the k-best successors for each iteration.

Solution 3:

Artificial Intelligence Homework #3 Solution

Share this:

Share this:

Description

Share this:

Related products

ASSIGNMENT 03

Homework 5: Heap ADT using STL Solution

Exercise 5: Regularized Linear Regression and Bias v.s. Variance Solution

LAB 05 QUESTIONS SOLUTION

Lab 4: Bash Script and Bitwise Operations Solution