Geniferology: 2015

Sunday, May 17, 2015

What is reinforcement learning?

[This is the English version translated from Chinese.]

Reinforcement learning is a branch of machine learning that is particularly suitable for controlling an autonomous agent who interacts with an environment. It uses sensory perception and rewards to continually modify its behavior.

When you hear reinforcement learning, it should invoke an imagery in your mind of a little critter such as this cockroach:

As for the idea of "environment", you should think of a maze such as in this classic game:

It includes some monsters chasing after you, some food that you could eat to increase your score (these represent negative and positive rewards). Of course, in real applications the "environment" and "rewards" can be abstract; This game is just a concrete example.

What goes in and out?

Keep in mind, the inputs to the reinforcement learning algorithm are the following:

States = the environment
For example: each square in the maze is a state
Actions = under each state, which actions are allowed?
Rewards = upon visiting each new state, there is an associated positive or negative utility

And the output of the algorithm is:

Policy = under each state, which action would you choose?

So this 4-tuple (S, A, R, P) constitutes a reinforcement learning system. In abstract algebra, we often use such tuples to define systems or structures.

A more concrete example is:

states S = each square of the maze, which can be represented by its coordinates, eg (1,3)
actions A = on each square of the maze, you can go either { up, down, left, right }
rewards R = under the current state, the square of the maze may contain food (+1) or a monster (-100)
policy P = a function from states $\rightarrow$ actions, ie, given any state, it returns with an action.

(S, A, R) are defined by the user, P would be automatically calculated by the algorithm.

Between humans and insects

The first question that comes to mind: why not just use this technique to build strong AI? But the current state of the art of reinforcement learning can only handle relatively small and simple environments, it is still rather helpless when faced with larger and more complex worlds, such as the $10^\mbox{xxx}$ state space of chess.

The key lies in the fact that, higher intelligent beings build up knowledge or world models in their brains, whereas naive reinforcement learning is only concerned with state-action pairs.

The leading researcher in reinforcement learning, Richard Sutton, thinks that RL is the only learning technique that takes into consideration elements such as autonomous agents, environments, and rewards, therefore it should be the top-level architecture in an AI system, where other modules such as logic and pattern recognition should subsume under its control. I think this makes a lot of sense.

Thus to build a strong AI, a possible scheme is to combine reinforcement learning with a certain ability to handle complex world models:

"You have made your way from worm to man, and much in you is still worm." -- Nietzsche, Thus spoke Zarathustra

"If men cease to believe that they will one day become gods then they will surely become worms." -- Henry Miller

Code

Learning AI is boring without programs. Here is a simple demo that I found on the web, the author is Travis DeWolf:

https://studywolf.wordpress.com/2012/11/25/reinforcement-learning-q-learning-and-exploration/

It only requires Python to run, but you may need to install PyGame first.

Cat, mouse, cheese:

The cat's behavior is simply to move towards the mouse (this is without intelligence), the mouse's behavior is learned by the algorithm.

Note that, in the main program and in cellular.py, the behavior of the maze world is defined, but this is purely a game program that has no intelligence in it. You can use { up, down, left, right } to control the agents' movements, and that's all of it.

The reinforcement learning algorithm is in qlearn.py, which is very short, and the actual learning code is essentially just one line, that is:

def learnQ(self, state, action, reward, value):

oldv = self.q.get((state, action), None)

if oldv is None:

self.q[(state, action)] = reward

else:

self.q[(state, action)] = oldv + self.alpha * (value - oldv)

Just this one line of code, is able to make the mouse avoid the cat, and eat the cheese. It will be explained below...

How does RL work?

Chapter 21 of《AI: a modern approach》has a very good introduction to RL. 《AIMA》is of course the classic AI textbook, many people say that they love AI after studying this book. The strength of this book is that it uses plenty of words to explain all concepts and principles patiently, so the reader would not feel disorganized. For example, Chapter 21 first explains passive reinforcement learning, which means to hold the policy fixed, and then simply calculate the agent's expected utility (ie the total rewards). Having established this foundation then we compare different policies. This way of thinking is very common in mathematics: First consider a case so simple that any idiot can solve, and then gradually introduce more complexity. For example in mathematical induction, to move from case N=1 to N→∞.

To avoid repeating the book, I will just explain the minimum knowledge required to understand Q learning.

Utility

U is the sum of rewards after a sequence of actions. For example, the utility of playing one chess move is not just the immediate reward of that move, but also includes the consequences after that move. For example, a player may greedily eat a pawn, but then got checkmated 10 moves later. Or, faced with delicious food, some people may choose not to eat it, for fear of getting fat.

The utility of a state is: assuming the policy is fixed, and considering all possible future transitions starting from this state, the average expectation of the total reward:

$$ U(S_0) = \mathbb{E}[ \; \sum_{t=0}^{\infty} \; \gamma^t \; R(S_t) \; ] $$

where $\mathbb{E[\;]}$ is expectation value, $\gamma$ is the discount factor, such as 0.9 or something.

Real example: consider a maze like this one:

The arrows represent one policy out of many possible policies.

According to this policy, starting from (1,1) we may get this outcome:

The orange number below is the reward for each state. In this example, every square that is not the goal, will have 0.04 deducted from the score.

However, even starting from the same state, using the exact same policy, can give different outcomes. For example, when the creature tries to crawl left from (1,3), the actual outcome may be falling one square downwards. Such state transitions are dictated by probabilities from the external world. (For example, someone gets a university degree, but the economy falls into recession, his actual salary may not be as high as he has expected).

Another outcome of the same policy could be:

Or this:

Bellman condition

The Bellman optimality condition is the central idea in dynamic programming.

The artificial intelligence community calls it reinforcement learning, but in control theory it is called dynamic programming; the two are synonymous. Richard Bellman in 1953 proposed this equation, while he was working at RAND corporation, dealing with operations research problems. He also coined the phrase "curse of dimensionality" to describe the major obstacle in dynamic programming.

The problem we try to solve is: to make a series of sequential decisions.

The Bellman equation says: "if we cut off a tiny bit from the endpoint of the optimal path, the remaining path is still an optimal path between the new endpoints."

In other words, if we have made a sequence of optimal decisions A B C D E…… then this sequence, with the endpoint A removed, = B C D E …… is still an optimal decision sequence when applied to the rest of the states.

(For example, you plan to drive from Los Angeles to New York:
$$ \mbox{Los Angeles} \rightarrow \mbox{Las Vegas} \rightarrow ... ... \rightarrow \mbox{New York} $$
and you have chosen the cheapest route, which passes through 6 stops, the 2nd stop being Las Vegas. Now if you remove the starting point Los Angeles, then the remaining route:
$$ \mbox{Las Vegas} \rightarrow ... ... \rightarrow \mbox{New York} $$
would still be the cheapest route for the remaining stops.)

Mathematically:
$$ U^*(S) = \max_a \{ R(a) + U^*(S') \} $$

$$ U^*(\mbox{whole path}) = \max_a \{ R(\mbox{choose action $a$ in current state}) + U^*(\mbox{rest of path}) \} $$

$*$ denotes "optimal". This seemingly simple formula is the entire content of dynamic programming; What it means is that: When seeking the path with the best value, we cut off a bit from the path, thus reducing the problem to a smaller problem; In other words, it is a recursive relation over time.

Delta rule

This is just a simple trick, appearing often in machine learning. Suppose we have an ideal value, we want to gradually adjust the current value so that it asymptotically approaches this ideal. What we do is:

$$ \mbox{current value} := \mbox{current value} + \alpha ( \mbox{ideal value} - \mbox{current value}) $$

where $\alpha$ is called the "learning rate". "Delta" ($\Delta$) referes to the difference between the ideal and current values.

Obviously, as we repeatedly perform the above step, the current value will get closer and closer to the ideal value.

(The differential form of the Delta rule is the familiar gradient descent: $x \mbox{ += } \eta \cdot \frac{dy}{dx}$）

Temporal difference (TD) learning

Now we apply the Delta rule on the Bellman equation to find the optimal path; This is called temporal difference learning.

Again we start with the simple case: assume policy is fixed, the aim is to learn the utility of all states.

Ideally, the value of U(S) can be obtained by: starting from state S, trying all possible state transitions, summing the total rewards along these paths, and finding the average total reward among all paths.

But in reality, our agent can only experience one state transition after each action.

So we need to apply the Bellman equation: The U value of a state S is equal to: its own reward, added with the U values of all possible subsequent states, taking the probabilistic average (ie expectation value), and multiplying with the discount factor $\gamma$:
$$ U(S) = R(S) + \gamma \sum_{S'} P(S \rightarrow S') \; U(S') $$
where P is the transition probability, S' is the subsequent state, $\sum$ is the summation over all subsequent states. In other words, this is a relation between the ideal U(S) and U(subsequents of S), as a recursive relation.

For example, assume the agent's estimations of state (1,3) and state (2,3) are respectively 0.84 and 0.92. Also the agent observes that under the current policy, the transition from (1,3) to (2,3) always occurs. Then the U values of these 2 states should satisfy this constraint:
$$ U(1,3) = -0.04 + U(2,3) \;.$$
In other words, this is the local constraint on the U values between 2 states.

The idea of TD learning: assume other U(S') estimates are correct, use the Bellman optimality constraint to adjust the U(S) of the current state. As the number of trials grows large, all the U estimates will approach ideal. The agent only needs to use this update rule:
$$ U(S) \mbox{ += } \alpha ( \; R(S) + \gamma U(S') - U(S) \; ) $$
$\alpha$ is the learning rate, it determines the speed of learning (but it cannot be too large, lest overshooting may occur). The stuff after $\alpha$ is the difference between the current estimated U(S) and the U(S) estimated using ideal constraint. For the ideal U(S) and U(S'), this difference would be 0. Whereas for each time step, we just use $\alpha$ to partially adjust this difference.

Lastly I should mention that, in the above equation of the ideal constraint, there is a summation over all probabilities P, but the P disappeared in the update formula. That is because the agent is acting in the environment, and this implicitly performs a sampling of the state transition probabilities. In other words, the summation is performed by the agent itself.

P is the probabilities of state transitions, in other words a model of the world. TD learning does not need to learn P, that is why it is called model-free learning. But just as we have said in the beginning, model-free is not necessarily a good thing. Human intelligence lies in the fact that we do have some pretty good models of the external world.

Q value

Q value is just a variation of U value; there is a U value for each state, and Q is the decomposition of U by all the actions in that state. In other words, Q is the utility of doing action A in state S.

What is Q good for? Below we will introduce active learning, and Q values, when combined with TD learning, can eliminate P in active learning, thus achieving the model-free effect.

The relation between Q and U is:

$$ U(S) = \max_A Q(A, S) $$

The above update rule needs only be revised with this relation:
$$ U(S) \mbox{ += } \alpha ( \; R(S) + \gamma \max_{A'} Q(A', S') - Q(A, S) \; ) $$

Active learning

In passive learning, with the policy fixed, we can already calculate the utility U(S) for each state S, or the utility Q(S,A) of doing action A under each state S.

If the policy can change, we just need to calculate the Q values for different policies, then choose the action A corresponding to the maximum Q value in each state, that would be the best policy, right?

When we actually execute the above, we find that the agent's policies are really bad! The reason being that, during the learning process, the Q values are estimates rather than ideals, and if we act according to such Q's, the agent becomes very short-sighted, and fails to find the optimal policy. (For example, a person always goes to the same restaurant, but if he walks a different path, he could find a better restaurant.)

The agent needs to try some unknown states or actions in order to learn the optimal policy; This is the so-called exploration vs exploitation trade-off.

The way to do this, is to artificially increase the utilities of unknown states:
$$ U(S) = R(S) + \gamma \max_A \mathcal{F}[ \; \sum_{S'} P(S \rightarrow S') U(S'), \; N(A, S) \; ] $$
where N(A, S) is the number of times the combination of state S and action A has occurred (= has been experienced), $\mathcal{F}$ is the exploration function, it normally returns the estimate of U, but when N is small (that means we have little experience of S, A), it will return a relatively larger estimate, which represents the utility of "curiosity".

Conclusion

I started trying to write an introduction to RL that everyone can understand, but found out that I could barely explain all the things after such a great length. Hope my girl friend can read and understand this :)

Thursday, April 16, 2015

什么是 kNN 算法？（What is k-nearest-neighbor?）

学习 machine learning 的最低要求是什么？我发觉要求可以很低，甚至初中程度已经可以。首先要学习一点 Python 编程，譬如这两本小孩子用的书：【1】【2】便可。数学方面，只需要知道「两点间距离」的公式（中学的座标几何会读到）。

这本书第二章介绍 kNN 算法，包括 Python 程序：

其他章节的数学要求可能不同，但我目的是想说明，很多实用的人工智能的原理，其实也很简单的。

kNN 是什么？ For example:

开始时，所有 data points 的 labels （颜色）已经是知道的。

kNN 要解决的问题是：图中那「？」点的 label 应该是什么颜色？

用肉眼直观来看，那「？」的位置，是在蓝色点密集的区域，所以最「贴切」的标签应该是蓝色。

kNN 的算法就是：

在已知的 data points 中，逐一点检视（把這每一點叫作 P）：

首先计算「？」和 P 之间的距离

所有距离计算之后，将他们由小至大 sort 好
从 sort 好的序列，取最前的 k 个（即距离最接近「？」的 k 个点子）
对这 k 个点，读出他们的 label（颜色）是什么，这是问题中已经知道的
所有这些 labels（颜色），哪个出现最多？（亦即是说，最接近「？」的 k 个点子，它们最普遍是什么颜色？）
这出现次数最多的颜色，就是答案

例如，使用者要求 k = 5 时：

最接近「？」的 5 个点，就是以「？」为中心的虚线圆圈内的 5 个点。它们的颜色顺序依次为：[ ⬤，⬤，⬤，⬤，⬤ ] 。出现的次数是 4⬤ 1⬤ ，所以最高次数的颜色是 ⬤ 。

如何用 Python 写 kNN ?

1. 计算点与「？」之间距离

首先，回忆中学的坐标几何里，两点间距离的公式。假设那两点是 A 和 B，它们的座标分别是 $(x_A, y_A)$ 和 $(x_B, y_B)$，则：
$$ \mbox{distance D} = \sqrt{ (x_A - x_B)^2 + (y_A - y_B)^2 } $$
注意，在这公式里，用 $x_A - x_B$ 还是 $x_B - x_A$ 是没有分别的，因为那是正数和负数的分别，而在 $()^2$ 之后便没有分别。（换句话说，计算 A，B 之间的距离，和计算 B，A 之间的距离，是一样的。）

每个点子有 3 个坐标，分别是： x = 「坐飞机旅程」、y = 「吃雪糕的量」、z = 「玩电玩时间」，而 labels 是：「很喜欢」、「普通喜欢」、「不喜欢」。

因为坐标有 3 个，所以处理的空间是 3 度空间，但为了初学者方便，我们只考虑其中两个座标，所以局限在 2 度空间（平面上）。如果要 generalize 到 N-度空间，那需要用到 N-度空间中两点间距离的公式，留给读者作为练习 ☺

写程式时，首先要知道那些点子是如何储存於某个 variable 中。在书中的 "dating" 例子里，那些点子的坐标是 datingDataMat，而 labels 是储存在 datingLabels。

（妳可以在 Python 命令行，呼叫 file2matrix 函数去准备那些点子，然后试试印出 datingDataMat 和 datingLabel 这两个 variables 的内容。例如 datingDataMat[ : , 1 ] 可以印出所有点子的第二个坐标（即吃雪糕量）。那「：」的意思是，不指定指标的 begin 和 end，所以是对那个指标「全取」。）

我们用比较浅显的方法重写书中的 Python 函数（書裡的程式用了 vector 和 matrix 的表示法，比較簡潔，但較難懂）：

    def classify1(inP, dataSet, labels, k):
        N = len(dataSet)
        Ds = array([0] * N)
        for i in range(N):
             x2 = (inP[0] - dataSet[i][0])**2
             y2 = (inP[1] - dataSet[i][1])**2
             D = sqrt(x2 + y2)
             Ds[i] = D

第 1 句：定义我们的 function。 inP 是 input point 的意思。

第 2 句： N 是我们 dataSet 的 size，即总共有多少点子。

第 3 句：我们要计算距离 D，而且有 N 个这样的距离，所以要将结果储存在 array 里。但使用 array 之前，要先定义它，并填上 0（这叫初始化，initialize）。 Ds 这名字的意思是「很多D」（如英语中的 dogs = dog 的众数）。

第 4 句是 loop：对於每一点，我们用 i 这个 index「指着」它。 Index 是处理 array 的惯用做法，因为 array 容许妳读取任意位置的元素。

第 5、6 句：计算 $\Delta x^2$ 和 $\Delta y^2$ 的值，注意，因为我们储存 x, y 的方法是 [x, y] 这样的 list，所以 x 用指标 [0] 读取，y 用指标 [1] 读取。

第 7 句：计算 $D = \sqrt{ \Delta x^2 + \Delta y^2 }$。

第 8 句：将计算好的 D 放进 array Ds 里。

2. 将距离排序

很简单，一句：

    Dsorted = Ds.sort()

注意，這些程式段落要適當地 indent 好，不然 Python 會出錯的。这句仍屬於 classify1 函数。

3. 读出点的颜色

刚才说错了，因为 Ds 排序之后，再弄不清哪个点子对应於哪个 label，所以我们要用的是 "arg sort" (argument sort，即是用指标来排序)。

例如，设 a = array( [17, 38, 10, 49] )，
a.sort() 会给出 [10, 17, 38, 49]，
但 a.argsort() 会给出 [2, 0, 1, 3]，这些是 indices（指标）。
换句话说： argsort 的结果是 旧的 indices 的新的排法。

Python 程式：

    D_sorted = Ds.argsort()
    first_k = D_sorted[0:k]        # 提取前 k 个元素（但這句其實不需用到）

现在要找出这 k 个点子的 labels，我们可以创造一个新的 array 储存它们：

    first_k_labels = array([0]*k)  # 准备空的 array
    for i in range(0,k):
        first_k_labels[i] = labels[ D_sorted[ i ]]

最后那句要说明一下：如果写 labels[i] 那就是第 i 个元素的 label。但我们要的是排序好之后的元素的 label，所以我们先「look up」这个 D_sorted array，找出排序后的元素的指标，再用那指标「look up」那 labels array。（就像查字典，我们想查某个中文字的日语翻译，但我们只有汉英字典和英日字典，所以要两次 look up，出现了 array1[ array2[ i ] ] 这样的语法。这是很常见的。）

4. 哪个颜色出现最多？

用这个 loop 计算每个 label 出现的次数：

    like1 = 0              ＃ 标签是「不喜欢这男孩」 的点的个数
    like2 = 0              ＃ 标签是「普通喜欢这男孩」的点的个数
    like3 = 0              ＃ 标签是「非常喜欢这男孩」的点的个数
    for i in range(0, k):
        label = first_k_labels[i]
        if label == 1:
            like1 += 1
        elif label == 2:
            like2 += 1
        elif label == 3:
            like3 += 1

然后，找出出现次数最多那个 label：

    if like1 > like2 and like1 > like3:      # 如果 like1 出现最多
        best_label = 1                        # 答案是「不喜欢这男孩」
    elif like2 > like1 and like2 > like3:    ＃ 如果 like2 出现最多
        best_label = 2                        # 答案是「普通喜欢这男孩」
    else:                                    ＃ 如果 like3 出现最多
        best_label = 3                        # 答案是「非常喜欢这男孩」

5. 完工！

    return best_label

Testing

实际的 Python program，还要加上这几行 "header"：

# -*- coding: utf-8 -*-          # 加了這句可以用中文 comments

from numpy import array
from math import sqrt

运行结果：

这例子里，「？」的坐标是 [4.54e+04, 4.98e+00]，那是我看了数据后作出来的。　4.54e+04 是 scientific notation，即是 $4.54 \times 10^4$。

我把 k 增加到 600，才开始看到 like2 不是 0。

Wednesday, April 8, 2015

什么是强化学习？ (What is reinforcement learning?)

Reinforcement learning 是机器学习里面的一个分支，特别善於控制一只能够在某个环境下 自主行动 的个体 (autonomous agent)，透过和环境之间的互动，例如 sensory perception 和 rewards，而不断改进它的行为。

听到强化学习，你脑里应该浮现一只曱甴那样的小昆虫，那就是 autonomous agent 的形象：

对「环境」(environment) 这概念，你应该想到像以下这经典游戏的迷宫：

包括有追捕你的怪物、和吃了会加分的食物（这些代表负值和正值的 rewards）。当然，实际应用的「环境」和「奖励」可以是抽象的，这游戏是一个很具体的例子。

输入／输出

记住，reinforcement learning 的输入是：

状态 (States) = 环境，例如迷宫的每一格是一个 state
动作 (Actions) = 在每个状态下，有什么行动是容许的
奖励 (Rewards) = 进入每个状态时，能带来正面或负面的价值 (utility)

而输出就是：

方案 (Policy) = 在每个状态下，你会选择哪个行动？

於是这 4 个元素的 tuple （S，A，R，P）就构成了一个强化学习的系统。在抽象代数中我们常常用这 tuple 的方法去定义系统或结构。

再详细一点的例子就是：

states S = 迷宫中每一格的位置，可以用一对座标表示，例如(1,3)
actions A = 在迷宫中每一格，你可以行走的方向，例如：｛上，下，左，右｝
rewards R = 当前的状态 (current state) 之下，迷宫中的那格可能有食物 (+1) 、也可能有怪兽 (-100)
policy P = 一个由状态 $\rightarrow$ 行动的函数，意即：这函数对给定的每一个状态，都会给出一个行动。

（S, A, R）是使用者设定的， P 是算法自动计算出来的。

人与虫之间

第一个想到的问题是：为什么不用这个方法打造人工智能？但现时的强化学习算法，只对比较细小和简单的环境适用，对於大的复杂的世界，例如象棋的 $10^\mbox{xxx}$ 状态空间，仍是 intractable 的。

关键就是，高等智慧生物会在脑中建立世界的模型 (world model) 或知识 (knowledge)，而强化学习只是关心简单的「状态－行动」配对。

强化学习的领导研究者 Richard Sutton 认为，只有这种学习法才考虑到 自主个体、环境、奖励等因素，所以它是人工智能中最 top-level 的 architecture，而其他人工智能的子系统，例如 logic 或 pattern recognition，都应该在它的控制之下，我觉得颇合理。

所以要制造 strong AI，一个可能的方案就是结合强化学习和某种处理复杂 world model 的能力：

「你们已经由虫进化成人，但在你们之内大部份仍是虫。」 -- 尼采, Thus spoke Zarathustra

「如果人类不相信他们有一天会变成神，他们就肯定会变成虫。」 -- Henry Miller

程式

学 AI 最紧要有 program，不然就会很枯燥。这是我在网上找到的一个特别简单的 demo，作者是 Travis DeWolf：

https://studywolf.wordpress.com/2012/11/25/reinforcement-learning-q-learning-and-exploration/

只要 Python 便可运行，但你可能要 install PyGame。

猫、老鼠、芝士：

猫的行动是简单地朝着老鼠追（没有智能），老鼠的行动是学习出来的。

注意，在 main program 和 cellular.py 这两部分，纯粹是定义了迷宫世界如何运作，基本上是一个 game，里面完全没有智能，你可以用｛上、下、左、右｝控制各 agent 的活动，如此而已。

强化学习的程式在 qlearn.py，很短，而真正学习的程式基本上只有一句，就是：

def learnQ(self, state, action, reward, value):

oldv = self.q.get((state, action), None)

if oldv is None:

self.q[(state, action)] = reward

else:

self.q[(state, action)] = oldv + self.alpha * (value - oldv)

单是这一句程式，就能令老鼠学到避开猫、吃芝士。以下再解释……

強化學習的原理

《AI－a modern approach》这本书第 21 章有很好的简介。《AIMA》自然是经典，很多人说他们是读这本书而爱上 AI 的。这本书好处是，用文字很耐性地解释所有概念和原理，思路很清晰，使读者不致有杂乱无章的感觉。例如 21 章首先讲 passive reinforcement learning，意思是当 policy 是固定时，纯粹计算一下 agent 期望的价值（utility，即 rewards 的总和）会是多少。有了这基础后再比较不同 policies 的好坏。这种思路在数学中很常见：首先考虑简单到连白痴也可以解决的 case，然后逐步引入更多的复杂性。例如数学归纳法，由 N=1 的 case 推到 N→∞ 。

为免重复，我只解释到明白 Q learning 的最少知识。

Utility (价值，或效)

U 是一连串行动的 rewards 的总和。例如说，行一步棋的效用，不单是那步棋当前的利益，还包括走那步棋之后带来的后果。例如，当下贪吃一只卒，但 10 步后可能被将死。又或者，眼前有美味的食物，但有些人选择不吃，因为怕吃了会变肥。

一个 state 的效用 U 就是：假设方案固定，考虑到未来所有可能的 transitions，从这个 state 开始的平均期望的 total reward 是多少 :

$$ U(S_0) = \mathbb{E}[ \; \sum_{t=0}^{\infty} \; \gamma^t \; R(S_t) \; ] $$

其中 $\mathbb{E[\;]}$ 代表期望值，$\gamma$ 是 discount factor，例如 0.9 或什么。

实例：考虑这简单的迷宫：

那些箭咀表示的是众多可能方案中的其中一个。

根据这个方案，由（1，1）开始的运行可能是这个结果：

下面橙色的值是每个 state 的 reward。在这例子中，每个不是终点的格，也会扣 0.04 分。

但从同一起点，同一方案，也可以出现不同结果，例如在 (1,3) 企图向右爬，但实际结果是向下跌一格；这些 state transitions 是由外在世界的机率决定的。（例如某人读了大学文凭，但遇上经济不景，他的薪水未必能达到行动的预期效果。）

同一方案的运行结果可以是：

或者：

Bellman condition

这是 dynamic programming（动态规划）的中心思想，又叫 Bellman optimality condition。

在人工智能里我们叫 reinforcement learning，但在控制论的术语里叫 dynamic programming，两者其实是一样的。 Richard Bellman 在 1953 年提出这个方程，当时他在 RAND 公司工作，处理的是运筹学的问题。他也首先使用了 "curse of dimensionality" 这个术语，形容动态规划的主要障碍。

考虑的问题是：要做一连串的 sequential decisions。

Bellman equation 说的是：「如果从最佳选择的路径的末端截除一小部分，馀下的路径仍然是最佳路径。」

换句话说，如果一系列的选择 A B C D E…… 是最优的，那么这系列除去开始的 A，那 B C D E …… 系列应用在后继的状态上也是最优的。

（例如，你从香港乘车到北京，选择了最便宜的路线，此路线经过 10 个车站，第二站是深圳：
$$ \mbox{香港} \rightarrow \mbox{深圳} \rightarrow ... ... \rightarrow \mbox{北京} $$
但如果除去出发点香港站，那么由第二站深圳到最后的北京站：
$$ \mbox{深圳} \rightarrow ... ... \rightarrow \mbox{北京} $$
这路线仍然是馀下 9 个站之间最便宜的。）

用数学表示：
$$ U^*(S) = \max_a \{ R(a) + U^*(S') \} $$

$$ U^*(\mbox{全路径}) = \max_a \{ R(\mbox{在当前状态下选取 a}) + U^*(馀下路径) \} $$

$*$ 表示「最优 (optimal)」。这条看似简单的式子是动态规划的全部内容；它的意义是：我们想获得最佳效益的路径，所以将路径切短一些，於是问题化解成一个较小的问题；换句话说它是一个 recursive relation。

Delta rule

这只是一个简单的 trick，在机器学习中经常出现。假设我们有一个理想，我们要逐步调教当前的状态，令它慢慢趋近这个理想。方法是：

$$ \mbox{当前状态} := \mbox{当前状态} + \alpha ( \mbox{理想} - \mbox{当前状态}) $$

其中 $\alpha$ 叫「学习速度 (learning rate)」。 "Delta" ($\Delta$) 指的是理想和现状之间的差异。

很明显，只要反覆执行上式，状态就会逐渐逼近理想值。

（Delta rule 的微分形式就是我们熟悉的「梯度下降」： $x \mbox{ += } \eta \cdot \frac{dy}{dx}$）

Temporal difference (TD) learning

将 delta rule 应用到 Bellman condition 上，去寻找最优路径，这就是 temporal difference learning。

我们还是从简单情况开始：假设方案固定，目标是学习每个 state 的 utility。

理想的 U(S) 值，是要从 state S 开始，试验所有可能的 transitions，再计算这些路径的 total rewards 的平均值。

但实际上，agent 只能够每次体验一个行动之后的 state transition。

所以要应用 Bellman condition：一个 state S 的 U 值，是它自身的 reward，加上所有可能的后继 states 的 U 值，取其机率平均，再乘以 discount factor $\gamma$：
$$ U(S) = R(S) + \gamma \sum_{S'} P(S \rightarrow S') \; U(S') $$
其中 P 是 transition 的机率，S' 是后继 state，$\sum$ 是对所有后继 states 求和。换句话说，这是理想的 U(S) 和 U(S 的后继) 之间的关系，是一个 recursive relation。

例如，假设 agent 现时对 state (1,3) 和 state (2,3) 的估值，分别为 0.84 和 0.92。又假设 agent 察觉到，根据现有方案，在 (1,3) 时总是会发生跳到 (2,3) 这个 transition。那么这两个 states 的 U 值，应该符合这条约束：
$$ U(1,3) = -0.04 + U(2,3) $$
换句话说，这是两个 states 之间，U 值的 local (局部的)约束。

TD learning 的思想是：假设其他 U(S') 的值正确，利用 Bellman optimality 来调整当下 state 的 U(S)。当尝试的次数多了，所有 U 值都会趋向理想。 Agent 只需要用这条 update rule：
$$ U(S) \mbox{ += } \alpha ( \; R(S) + \gamma U(S') - U(S) \; ) $$

$\alpha$ 是 learning rate，它决定学习的速度（但它不能太大，避免 overshooting）。后面那东西是 U(S) 和 U(S) 的估值 (estimation) 之间的差别。对於理想的 U(S) 和 U(S')，那差别会是 0。而在每个 time step，我们只是用 $\alpha$ 部分地 调整这个差别。

最后一提，在上面理想约束的公式里，有对於机率 P 的求和，但在 update formula 中 P 不见了。那是因为 agent 在环境中的行动，暗含了对於 state transition 机率的 sampling (随机地取样本)。换句话说，那机率求和是由 agent 本身体现的。

P 是 state transitions 的机率，换句话是关於世界的一个 model。 TD learning 不需要学习 P，所以叫 model-free learning。但正如开篇时说过，model-free 并不一定是好事，人的智慧就是基於我们对外在世界有一些很好的 models。

Q value

Q 值只是 U 值的一个变种 ; U 是对每个 state 而言的，Q 把 U 值分拆成每个 state 中的每个 action 的份量。换句话说，Q 就是在 state S 做 action A 的 utility。

Q 和 U 之间的关系是：

$$ U(S) = \max_A Q(A, S) $$

Q 的好处是什么？下面将会介绍 active learning，而 Q value 配合 TD learning，可以在 active learning 中也消除 P，达到 model-free 的效果。

上面的 update rule 只要用这个关系改写就行：
$$ U(S) \mbox{ += } \alpha ( \; R(S) + \gamma \max_{A'} Q(A', S') - Q(A, S) \; ) $$

Active learning

在 passive learning 中，方案不变，我们已经能够计算每个 state S 的效用 U(S)，或者每个 state S 之下行动 A 的效用 Q(S, A)。

如果方案是可以改变的，我们只需计算不同方案的 Q 值，然后在每个 state S 选取相应於最大 Q 值的行动 A，那就是最佳方案，不是吗？

实际上执行的结果，却发现这些 agent 的方案很差！原因是，学习过程中的 Q 值是 estimate，不是理想的 Q 值，而如果根据这样的 Q 行动，agent 变得很短视，不会找到 optimal policy。（例如，某人经常吃同一间餐馆，但循另一路径走，可以发现更好的餐馆。）

Agent 需要尝试一些未知的状态／行动，才会学到 optimal policy；这就是所谓的 exploration vs exploitation （好奇心 vs 短暂贪婪）之间的平衡。

方法是，人工地将未知状态的价值增加一点：
$$ U(S) = R(S) + \gamma \max_A \mathcal{F}[ \; \sum_{S'} P(S \rightarrow S') U(S'), \; N(A, S) \; ] $$
其中 N(A, S) 是状态 S 和行动 A 这对组合出现过（被经历过）的次数，$\mathcal{F}$ 是 exploration 函数，它平时回覆正常的 U 的估计值，但当 N 很小时（亦即我们对 S，A 的经验少），它会回覆一个比较大的估值，那代表「好奇心」的效用。

结语

本来想写一篇人人能读懂的 RL 简介，但发觉写到长篇大论才勉强解释完。希望女朋友能读懂 :)

Wednesday, March 25, 2015

什么是张量？(What are tensors?)

向量 = vector
向量空间 = vector space

两个向量空间 U, V 之间的线性变换，可以用 matrix 表示。

如果 U 的维数是 m，V 的维数是 n，则 $T: U \rightarrow V$ 的矩阵形式是 $m \times n$。

矩阵应该正确地理解为：两个向量空间之间的线性变换。矩阵只是线性方程组的简写；是「线性」这性质逼使这些变换成为矩阵。

张量

张量是 双线性变换 (bi-linear form) 的 universal form (万有形式)。首先要了解什么是 bi-linear form，然后了解什么是 universal form。

Bi-linear form

Linear map (or transformation) 的意思是，一个由向量空间 U 到 V 的变换，服从：

$T: U \rightarrow V$

$T(v_1 + v_2) = T(v_1) + T(v_2)$

$T(k v) = k T(v) $

Bi-linear map 是 linear map 的推广，由向量空间 U，V 到 W 的变换，使其对 U 和对 V 也是线性变换：

$T: U \times V \rightarrow W$

$T(a v_1 + b v_2, u) = a T(v_1, u) + b T(v_2, u)$

$T(v, a u_1 + b u_2) = a T(v, u_1) + b T(v, u_2)$

Universal form

Universal form 用来表达某种结构的「最一般形式」，这个 idea 来自 category theory。

In our case，首先考虑两个向量空间之间的 tensor product，⨂：

那 $\times$ 代表 Cartesian product，即是 U，V 里面各取一个向量的任意组合。

By definition，那 tensor product 当然是 bi-linear 的。

但这时，如果有另一个 bi-linear map 射向 X：

注意，这 bi-linear map 未必是 tensor product，它只需要是任何 bi-linear map。

而我们要求存在第三个唯一的 map，通常用！号标记：

它使得「从 U x V 到 W 再到 X」，和「从 U x V 直接到 X」是一样的。换句话说，就是这个由箭头表示的 diagram 是「可交换的」，即 "commutes" 。

实际的意思，是因为那 tensor product 保存了所有关於 bi-linear forms 的「资讯」，所以由 W 可以「重构」任何 U，V 之间的 bi-linear maps。而且，这个万有形式是最 minimal 的，没有多馀的重复，所以只能有一个 unique map 由 W 射向 X。

所以说： tensor product 是所有 bi-linear forms 的 universal form。

Bi-linear maps 可以有很多例子，但它们未必是 ⨂ 的模型 (models)。例如：两个向量的 inner product、cross product、两个矩阵的 matrix product、两个复数的乘积、都是 bi-linear 的，但它们都不是 tensor product 的模型，理由是因为它们把双线性变换后的资讯萎缩了 (collapsed)。但两个矩阵的 Kronecker product 是 tensor product，还有另一个模型是最常见的，利用 dual space V* 来描述。

关於张量的代数，还有很多特性，但其抽象意义就是这个。我暂时还没有学懂张量的详细性质，这里只能简介。

Remark:

最后解释一下：

两个向量空间的 direct sum 就是它们的笛卡儿乘积 U x V。如果 U 是三度空间，V 是两度空间，则 U x V 有 3 + 2 = 5 度，它的 vectors 是像： $(v_1, v_2, v_3, u_1, u_2)$ 那样的元素。

但在张量积的空间 (tensor product space)，$U \otimes V$ 的度数是 3 x 2 = 6。一般来说它的度数可以大很多。为什么张量积的空间可以大过两个向量空间的笛卡儿积？那是因为我们在谈论两个空间之间的变换的空间。如果我们考虑非线性变换（例如多项式变换）、或是可微分变换，那些空间甚至是无穷维的！

Reference:

Guo, Hongyu 2014: Modern mathematics and applications in computer graphics and vision 这本书很易懂，我從這書中抄了一些內容 :

Monday, March 16, 2015

什么是遗传／进化算法？（What are evolutionary algorithms?)

这是基於 进化论 而启发出来的一种很特别的机器学习技巧。我最近渐渐明白到它可能是破解 strong AI 的关键。 Ben Goertzel 和我见面的谈话中也特别注重这一方向。原来 Alan Turing 很早就看到进化和 machine learning 之间有明显关系，而现时机器学习的一个很有名的研究者 Leslie Valiant 也在新书 (English version) 中谈论这课题。

进化论

首先，要理解什么是进化论。举例来说，假设某只动物的基因变异令牠比同伴有更锋利的爪和牙齿，那么那牠就能吃到同伴吃不到的食物，也能抢到更多异性去交配。久之，牠的同伴会较少后裔、甚至绝种，而牠的后裔则会越来越强盛。这就是「物竞天择适者生存」。

人类的情况有些不同，因为人类有文化 (意即靠语言传授下来的那些知识)，所以有些优点不是纯粹由基因可以看出来。说到这个问题很具争议性，因为涉及优生学等道德议题。人们喜欢举反例，例如说『这个男生长得不好看却有很多女朋友』等等，我不想扯得太远。想想看，人类最强烈的感情，莫过於找不到伴侣传宗接代，而我们得不到异性的青睐就会很痛苦，或者看到别人有更多「资源 resource」会妒忌，这些都是进化很有力的证据。人类的感情通常可以用进化解释，这科目叫 evolutionary psychology (进化心理学)。

我个人认为，竞争是生物界以至人类社会的基本状况，我们必须接受并习惯它。人类在商业、战争、体育、文艺、甚至下棋也是在竞争。「天行健君子以自强不息」，那是很好的勉励。

进化学习

以前说过，机器学习的目的就是要在浩瀚的 假设空间 (hypothesis space) 中寻找那些能正确地解释现实的语句。例如婴孩可能用「有胡子就是爸爸」来解释身边出现的人。

问题是假设空间太大，我们如同「在禾秆推里找一只针」，单靠 brute force 不是办法。

进化算法就是在很大的空间里寻找某个 最佳答案 (optimal solution) 的一个很强的技巧。它把需要寻找的 candidates 模拟成一个「人口」population，任由这些 candidates 在某个人工的环境下竞争，然后，每次选取得分最高的一小撮样本，让它们「交配」，即 基因重组 (genetic recombination)，那样产生新一批的 candidates，如此逐步推向越来越优秀的 solutions。

举个例子，研究者用进化算法进化出一些电子电路，例如用於音响的滤波器 (filter) ，其 performance 甚至比人手设计的还要优越。而且，那电路很不规则而且难理解，人们不知道它是如何运作的，有些部分甚至有多馀的元件存在。那是因为进化的过程中，有时一些表面上没用的「器官」在组合之后或许变有用，所以进化的结果也常保留很多「废物 junk」。

我就是想用这方法去写 Genifer 的学习功能（我们要进化的 candidates 就是那些逻辑句子 logic formulas）。

Algorithm

具体的算法怎样？

初始时，预备一个 population，可以是随机产生的。
对每个 candidate 进行估值 (evaluation)，即是让这个 candidate 在人工的世界里生存。例如我们测试它在所需的功能的表现如何。那计分的方法叫 objective function 「目的函数」。
选取表现最好的 N 个 candidates，进行 变种 mutation 或 重组 recombination。变种是作用在单个 candidate 上的，重组则需要一对 candidates。那就涉及到我们怎样将设计空间的元素表示为「基因」，於是有所谓表述 (representation) 的问题。这编码是要由研究者设计的。
在评分的时候，可以容许那些 candidates 在环境中互相合作亦同时竞争，那叫 cooperative evolution。我觉得 Genifer 有需要用这方法，因为在知识库中的每个知识片段，是要和其他知识互相作用，那逻辑系统才能推导出有意思的结果。

这本书（作者的网站免费提供在线阅读）深入浅出地介绍了几个「nature-inspired」算法：

例如这个基因算法的程式，用 Ruby 写成，只需短短一页。

实习

那个例子叫 "OneMax"，每个个体是一串 0 和 1 的字串，目标是全部变成 "1"。

这个例子特别简单，因为「基因」就是个体的显形 (phenotype)，省略了 基因表述 (gene expression) 这步骤，便於学习。

实际运行时，发生了有趣的现象。我先试 Ruby，然后把它翻译成 Scala，但奇怪地 Scala 版本的运行没有收敛到完美的结果，而 Ruby 版本很快就达到了。

我把 Ruby 和 Scala 进化的两个 populations 用图像显示出来，每一横列是一个个体。第一片段是 Ruby「正常」的进化（到结尾时已经进化出第一个完美的纯 "1" 个体）：

第二个是我写的 Scala 「不正常」进化：

可见在病态版本，那些孩子逐渐变成一模一样，而且每个也有同样的缺陷（因而显示成垂直线）。

为什么呢？ Debug 了整整两天，才发现一个把 0 写作 1 的错误，令到那个 point mutation 的函数形同虚设，亦即是说病态版本根本缺乏 point mutations。

那是一个非常有启发性的 bug，使我学到了，单是 sexual reproduction 还不能做到高效率的进化，原来个体的 point mutations 也有关键的重要性，而我一直误以为 有性生殖 是产生 多样性 (diversity) 的唯一途径。我依稀记得文献里曾提及这点，但不记得了。有趣吗？ :)

Blog Archive

Latex Maths

Sunday, May 17, 2015

What is reinforcement learning?

What goes in and out?

Between humans and insects

Code

How does RL work?

Utility

Bellman condition

Delta rule

Temporal difference (TD) learning

Q value

Active learning

Conclusion

Thursday, April 16, 2015

什么是 kNN 算法？（What is k-nearest-neighbor?）

如何用 Python 写 kNN ?

1. 计算点与「？」之间距离

2. 将距离排序

3. 读出点的颜色

4. 哪个颜色出现最多？

5. 完工！

Testing

Wednesday, April 8, 2015

什么是强化学习？ (What is reinforcement learning?)

输入／输出

人与虫之间

程式

強化學習的原理

Utility (价值，或效)

Bellman condition

Delta rule

Temporal difference (TD) learning

Q value

Active learning

结语

Wednesday, March 25, 2015

什么是张量？(What are tensors?)

张量

Bi-linear form

Universal form

Remark:

Reference:

Monday, March 16, 2015

什么是遗传／进化算法？ （What are evolutionary algorithms?)

进化论

进化学习

Algorithm

实习

什么是遗传／进化算法？（What are evolutionary algorithms?)