Q-Learning


R. Sutton at. al. explained reinforcement learning in very simple terms[1]:

“Reinforcement learning is learning what to do-how to map situations to actions-so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may act not only the immediate reward but also the next situation and, through that, all subsequent rewards. These two characteristics trial-and-error search and delayed reward-are the two most important distinguishing features of reinforcement learning”.

Q-learning, a reinforcement learning technique, learns an action-value function. This function provide the  expected utility of pursuing an action in a given state and following a fixed policy afterward. In simpler terms, an agent using Q-learning learns a mapping for which action he should take when he is in one of the states of the environment. This mapping can be viewed as a table, called a Q-table, with rows as states of the agent and columns as all the actions an agent can perform in its environment. Values of each cell in a Q-table signify how favorable an action is given that an agent is in particular state. Therefore, an agent selects the best known action, depending on his current state: arg max Q(s; a).

Every action taken by an agent affects the environment, which may result in a change of the current state for the agent. Based on his action, the agent gets a reward (a real or natural number) or punishment(a negative reward). These rewards are used by the agent to learn. The goal of an agent is to maximize the total reward which he achieves, by learning the actions which are optimal for each state. Hence, the function which calculates quality of state-action combination is given by :

Q : S X A -> R

Initially, random values are set in the Q-table. Thereafter, each time an agent takes an action; a reward is given to agent, which in turn is used to update the values in Q-table. The formula for updating the Q-table is given by:

Q  Learning Fomula

The major advantages of using Q-learning are that it is simple, and it support dynamic online learning.

References

[1]  R. Sutton and A. Barto. Reinforcement Learning:An Introduction. The MIT Press Cambridge, Massachusetts London, England

This entry was posted in General.

Leave a Reply