Reinforcement Learning (Azhar)

2020-02-27 58浏览

1.Reinforcement Learning Azhar Aulia Saputra October, 20th 2018
2.So far… Supervised LearningData:(x, y) x is data, y is labelGoal:Learn a function to map x -> yExamples:Classification, regression, object detection, semantic segmentation, image captioning, etc. FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Cat Classification This image is CC0 public domain Lecture 14 - Oc5tobeMr,a2y02th32, 0210817
3.So far… Unsupervised LearningData:x Just data, no labels!Goal:Learn some underlying hidden structure of the dataExamples:Clustering, dimensionality reduction, feature learning, density estimation, etc. FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung 1-d density estimation 2-d density estimation 2-d density images left and right are CC0 public domain Lecture 14 - Oc6tobeMr,a2y02th32, 0210817
4.Today:Reinforcement Learning Problems involving an agent interacting with an environment, which provides numeric reward signalsGoal:Learn how to take actions in order to maximize reward FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - Oc7tobeMr,a2y02th32, 0210817
5.Overview - What is Reinforcement Learning? - Markov Decision Processes - Q-Learning - Policy Gradients FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - Oc8tobeMr,a2y02th32, 0210817
6.Reinforcement Learning Agent Environment FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O9ctobeMr,a2y02th32, 0210817
7.Reinforcement Learning State st Agent Environment FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - 1O0ctobeMr,a2y02th32, 0210817
8.Reinforcement Learning State st Agent Environment Action at FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - 1O1ctobeMr,a2y02th32, 0210817
9.Reinforcement Learning State st Reward rt Agent Environment Action at FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - 1O2ctobeMr,a2y02th32, 0210817
10.Reinforcement Learning State st Agent Reward rt Next state st+1 Environment Action at FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - 1O3ctobeMr,a2y02th32, 0210817
11.Cart-Pole ProblemObjective:Balance a pole on top of a movable cartState:angle, angular speed, position, horizontal velocityAction:horizontal force applied on the cartReward:1 at each time step if the pole is upright FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung This image is CC0 public domain Lecture 14 - O1c4tobeMr,a2y02th32, 0210817
12.Robot LocomotionObjective:Make the robot move forwardState:Angle and position of the jointsAction:Torques applied on jointsReward:1 at each time step upright + forward movement FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O1c5tobeMr,a2y02th32, 0210817
13.Atari GamesObjective:Complete the game with the highest scoreState:Raw pixel inputs of the game stateAction:Game controls e.g. Left, Right, Up, DownReward:Score increase/decrease at each time step FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O1c6tobeMr,a2y02th32, 0210817
14.GoObjective:Win the game!State:Position of all piecesAction:Where to put the next piece downReward:1 if win at the end of the game, 0 otherwise FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung This image is CC0 public domain Lecture 14 - O1c7tobeMr,a2y02th32, 0210817
15.How can we mathematically formalize the RL problem? State st Agent Reward rt Next state st+1 Environment Action at FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - 1O8ctobeMr,a2y02th32, 0210817
16.Markov Decision Process - Mathematical formulation of the RL problem - Markovproperty:Current state completely characterises the state of the world Definedby:: set of possible states : set of possible actions : distribution of reward given (state, action) pair : transition probability i.e. distribution over next state given (state, action) pair : discount factor FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O1c9tobeMr,a2y02th32, 0210817
17.Markov Decision Process - At time step t=0, environment samples initial state s0 ~ p(s0) - Then, for t=0 untildone:- Agent selects action at - Environment samples reward rt ~ R( . st, at) - Environment samples next state st+1 ~ P( . st, at) - Agent receives reward rt and next state st+1 - A policy Ḗ is a function from S to A that specifies what action to take in each state -Objective:find policy Ḗ* that maximizes cumulative discountedreward:FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O2c0tobeMr,a2y02th32, 0210817
18.A simpleMDP:Grid World actions = { 1. right 2. left 3. up 4. down } states ★ ★ Set a negative “reward” for each transition (e.g. r = -1)Objective:reach one of terminal states (greyed out) in least number of actions FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O2c1tobeMr,a2y02th32, 0210817
19.A simpleMDP:Grid World ★ ★ ★ ★ Random Policy Optimal Policy FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O2c2tobeMr,a2y02th32, 0210817
20.The optimal policy Ḗ* We want to find optimal policy Ḗ* that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability…)? FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O2c3tobeMr,a2y02th32, 0210817
21.The optimal policy Ḗ* We want to find optimal policy Ḗ* that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability…)? Maximize the expected sum of rewards!Formally:with FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O2c4tobeMr,a2y02th32, 0210817
22.Definitions:Value function and Q-value function Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … FAeiz-hFeairLAi &ulJiausStianpJuothrnason & Serena Yeung Lecture 14 - O2c5tobeMr,a2y02th32, 0210817
23.s:'>Definitions: