Saturday, December 20, 2014

Cognitive micro-actions

Consider this definition of "grand father" in predicate logic:

$$\mbox{father}(X,Y) \wedge \mbox{father}(Y,Z) \rightarrow \mbox{grandfather}(X,Z).$$

The use of variables makes this formula rather long and unnatural.  Long formulas are difficult to learn via inductive learning.

My latest idea is to introduce procedural elements into logic so that certain operations intuitive to the human mind can be carried out via very short instructions.

Working memory structure

Cognitive science studies have shown that the human brain can hold approximately "7 plus or minus 2" items in working memory.

Imagine that working memory is like a data structure (eg a linked-list, tree, stack, queue, graph, etc..) that can be programmed with micro-instructions.



In logic, every problem must be fashioned as deduction.  This makes certain simple operations complicated to define.  For example, this is the definition of "append" in Prolog:

$$\mbox{append [Head|Tail] to List gives Head | List2} ← \mbox{append Tail to List gives List2}$$

which is 13 symbols long.

Notice that the variables are inter-dependent among conjuncts in the formula:


This creates complications in pattern matching (unification) and substitution.

Suppose that Genifer's working memory has a linked-list tool kit, with micro-instructions such as:

  • focus on first list item
  • move to next list item
  • replace focused list item with X
  • insert X before current focus
  • etc...
Such instructions are very short and are usually atomic or have only 1 argument.  This is a significant advantage during machine learning.



Eliminating variables in logic

Consider again the grandfather formula:



$$\mbox{father}(X,Y) \wedge \mbox{father}(Y,Z) \rightarrow \mbox{grandfather}(X,Z).$$


Relation algebra (not to be confused with "relational algebra" in database theory) offers a way to eliminate variables.  It can be regarded as a form of combinatory logic (the major theory of eliminating variables), focused on relations.

For example, the grandfather example is formulated as:

$$\mbox{father} \circ \mbox{father} = \mbox{grandfather}$$

which is similar to natural-language "father's father is grandfather".  Notice that the above is a complete statement about the "operators" father⚫ and grandfather⚫.

Look at the following inference carried out in relation algebra, using equations purely, and substitution of equal by equals:

$\mbox{john = father paul}$

$\mbox{paul = father pete}$

$\mbox{john = father father pete = grandfather pete}$

Also notice how similar the above derivation is to natural language.


Genifer logic

I suggest to use an eclectic mix of logical and procedural elements;  such a set will almost certainly be Turing universal, if it is not too impoverished.
  • equations
  • probabilistic conditioning (Bayesian arrow)
  • subsumption


Learning

A hybrid formula of the form:

$$conditions \rightarrow action$$

is neither logical nor procedural, but it can be learned via logical inductive learning or reinforcement learning.

  • In logical learning, the above conditional statement is construed as "if conditions are satisfied, it would be appropriate to carry out the action".  Such a statement becomes declarative and thus can be learned via logical inductive learning.
  • In the procedural setting, when a reward is received, the sequence of actions would be recorded and it would include the above hybrid formula. So the formula can be up-scored as usual.
These 2 learning processes may occur independently.


Monday, December 15, 2014

Searching in a lattice (在格中搜寻)

[ From now on I will write in English, though my English is less fluent than Chinese, but I find it more important to address international partners. ]

Last time I briefly introduced inductive learning;  how the search can be conducted in 2 directions: generalization (一般化) and specialization (特殊化).

Let me first introduce the simplest searching algorithm:  binary search (二分法寻找).


Binary search

Remember the game "20 Questions", in which one player thinks of a "thing" and the other player tries to guess what it is by asking only YES / NO questions.

It may be a bit surprising that an arbitrary thing in the mind can be categorized only by 20 yes/no questions, but that is an illustration of the power of binary search.

The number of distinct things that can be specified by 20 questions is $2^{20} = 1048576$.  That is because, at each level, the number of nodes is doubled.  In other words, the number of nodes grows exponentially as powers of 2 (以2的幂级数作指数上升).


If I play the 20 Questions game, (given that I don't have any clue as to the player's preference), I would first ask:  "Is the thing physical or abstract?"  and if physical I would ask "Is it natural or artificial?"  or if abstract I would ask "Is it (good or bad) or neutral?"

The point is that I would try to divide the classes into roughly half-and-half proportions.  That is going to maximize the information I can gain.  If, however, I ask the first question "Is it spaghetti with meatballs?" it would be too specific and a waste of chance to narrow down the domain.

The more I can split the domain into 1/2, the more optimal my efficiency.  The more it deviates from 1/2 would be the less efficient.

The simplest case is binary search over a list of numbers (ie, we search for a given number in a list of unknown numbers;  all we know is that the list is sorted from small to big).

This is a nice video explaining the binary search algorithm:
There are many more web pages explaining binary search, so I will skip it here.

Notice that in mathematics, we have what are called order relations (大小关系).  For example the natural numbers can be ordered as follows:

1 < 2 < 3 < 4 < 5 < 6 .... $\infty$

and the integers can be ordered similarly:

$-\infty$ .... -5 < -4 < -3 < -2 < -1 < 0 < 1 < 2 < 3 < 4 < 5 .... $\infty$

The rational numbers can also be ordered as a chain, but the chain cannot be listed out because there is always another rational number between every 2 of them.  But these numbers still form a chain (链) in the sense that any 2 number can be compared and ordered (sorted).

The computational complexity (计算复杂度) of binary search is $O(\log N)$, meaning that if we have a list of $N$ numbers, it would take on average $\log N$ steps in the algorithm to find the position of a target.  Mathematically, the relation of $a^N$ and $\log_a N$ are inverses (互逆运算) of each other.

In other words, when the domain (ie, the search space) is of size $N$, the number of search steps is the logarithm of $N$.  Remember that when we take the exponent (power, $n$ 次方) of something, the result gets very large.  Conversely, when we take the logarithm of something, the result is shrunk to very small.  For example, taking the logarithm in base 10 is like counting the number of digits in a number.  So Bill Gates (a multi-billionaire) 's wealth is a 10-digit number.  My wealth is 5-digit number.

The idea of binary search (or in general, splitting the domain successively into equal partitions) is very important because it allows us to "take the logarithm" of the search space, shrinking the complexity to a manageable size.


Lattices (格)

Now we turn to our main concern.  For logic-based machine learning, we conduct a search in the hypothesis space (假设空间).  That space is organized along a general-to-specific axis (由一般到特殊的方向轴).  Even more difficult is that the elements are not ordered as a chain.  They form a lattice which is a mathematical generalization of an order relation.

We can denote $F_1 > F_2$ to mean "Formula 1 is more general than Formula 2".  However, not all formulas can be compared by generality.  This is why our search space is not a simple chain but a lattice.

For example, we may say that a woman is someone who has long hair, breasts, and wears skirts:
    $F_1:  \mbox{woman} \leftarrow \mbox{long hair}, \mbox{breasts}, \mbox{wears skirts}$

But we can be slightly more specific, adding that she has finer facial features, painted fingernails, speaks in a softer voice, etc:
    $F_2:  \mbox{woman} \leftarrow \mbox{long hair}, \mbox{breasts}, \mbox{wears skirts}, \mbox{fine facial features}, \mbox{painted nails}, \mbox{softer voice}$

So $F_1 > F_2$ (Formula 1 is more general than Formula 2).

We can also be more general by saying:  "a woman is anyone who has long hair":

    $F_3:  \mbox{woman} \leftarrow \mbox{long hair}$

So $F_3 > F_1$.

Generally, we see that if we add more terms to the conditional side of a formula, the more specific it becomes (because the conditions are harder to satisfy);  and if we remove terms from the conditions, the formula becomes more general (easier to satisfy).

But how about this alternative definition:  "a woman is someone with XX chromosome and female genitalia":

    $F_4:  \mbox{woman} \leftarrow \mbox{XX chromosome}, \mbox{female genitalia}$

The formula $F_4$ cannot be compared with the above formulas along the general-specific axis, because it removes some conditions and adds new ones at the same time.  But $F_4$ is also located along some chains of other formulas.

The general situation is that:  not all formulas can be compared with each other, but for any 2 formulas, we can always find a third one that is more general (or specific) than the two.

As an example, this is a lattice (a line going from high to low means ">"):


This is also an example of a power set (幂集).  Given a set, say, $\{a, b, c\}$, the power set is the set of all its subsets.  For example, $\{a, b\}$ is one subset, $\{c\}$ is another subset.  The empty set $\emptyset$ is also a subset.

Here, ">" means the superset relation.

Notice that the power set in the above example has 8 elements, and $8 = 2^3$.  The 3 is because the original set $\{a, b, c\}$ has 3 elements.  In general, the power set of a set $A$ has $2^{|A|}$ elements, where $|A|$ denotes the number of elements in $A$, ie, the size of $A$. 
In mathematics, everything has an explanation.  There is not a single thing in mathematics that has no explanations.  Why does the power set have $2^{|A|}$ elements?  Let's try to list all the elements of the power set of $\{a, b, c\}$.
From the bottom of the lattice diagram:   First we list the empty set (the set with 0 elements).  Then we list the sets with 1 element:  $\{a\}$, $\{b\}$, and $\{c\}$.  Then we list the sets with 2 elements:  $\{a, b\}$, $\{a, c\}$, and $\{b, c\}$.  Then we list the set with all 3 elements:  $\{a, b, c\}$.  Then we're done. 
Another way to do this, is to ask, for each element of $\{a, b, c\}$, whether to include it in a set or not.  So for each element we are asking a YES / NO question.  For each question we can have 2 answers.  So there would be a total of $2 \times 2 \times 2$ different answers.  Because for each element we can have 2 choices:  whether to include it or not. 
Even more generally, if we have a mapping from a set $A$ to a set $B$, the total number of different mappings from $A$ to $B$ is equal to $|B|^{|A|}$.  For example, if we are trying to match 5 boys to 3 girls, there would be $3^5 = 243$ different mappings (multiple boys can map to the same girl, but each boy cannot choose more than 1 girl).  That is different from $5^3 = 125$ where multiple girls can choose the same boy, but each girl can only choose 1 boy. 
That is the meaning of exponentiation, from a source to a set of choices, $\mbox{choices}^\mbox{source}$.  The size of $B^A$ is the number of possible functions (mappings) from $A \rightarrow B$. 
Even if the sets are uncountable, such as the points on the line interval [0,1], the exponentiation can be understood as functions from a set to another set.

Notice that in the lattice above, $\{a\}$ cannot be compared to $\{b\}$, but $\{a, b\} > \{a\}$ and $\{a, b\} > \{b\}$, so we can find an element that is "bigger" than both $\{a\}$ and $\{b\}$.

Also notice that the common element that is bigger than both $x$ and $y$ is denoted $x \vee y$.  Such an element should always be unique in a lattice.  This means that the following diagram is forbidden (for lattice orders):


because the red lines and blue lines both indicate a possible $x \vee y$.

This is also not allowed:

because both $y \vee z = z$ and $y \vee z = w$ are possible.

These are some more complicated, but acceptable, lattices:






Conclusion

In Genifer, our learning algorithm needs to search within the lattice of formulas (hypotheses).  Whenever Genifer experiences some new examples, it searches for the most general generalization (mgg) of the given examples, as long as the hypothesis does not contradict her existing knowledge.

It is still a somewhat open question as to how to conduct this search efficiently.  We wish to find a fast algorithm analogous to binary search along a chain, but within a lattice.  Maybe the reader can help me solve this problem?

Sunday, December 7, 2014

什么是贝叶斯网络? (What are Bayesian networks? in Chinese)

命题和它们之间的关系


上次介绍过「命题逻辑」,命题即是能赋予真假值 (truth value) 的东西。  例如「昨天下雨」、「人是动物」等。

命题之间有「关系 relations」,例如:「吃了不洁食物」→「肚子痛」;
记作 P → Q 或叫「P 蕴涵 Q」。

这个「箭头」是一切逻辑中最重要的,有了它才可以有「推导 deduction」这回事,否则知识变成一堆离散的没关连的点子。

宇宙无限,但人脑有限,以有限的脑怎能理解无限的世界? 那是靠 general rules。  那些 rules 里面有变量 (variables),所以逻辑 formulas 可以像 template (模版) 那样套到不同的实例上,然后推出不同的结论,於是「经验世界」可以被压缩成有限的资讯。

总言之,变量压缩的基础,箭头推导的基础。


机率


但,在经典逻辑中,所有命题都是非真即假 (binary)的,这当然有局限,所以我们这一代的研究者都不用经典逻辑了。   我的 Genifer、Ben Goertzel 的 OpenCog、王培的 OpenNARS,我们都设计了某种 "uncertainty logic"。



我使用的办法是经典的概率论加上模糊性 (fuzzy);  例如,如果说「玛莉很高」,那概率的分布就可以像下图的红色曲线那样:


即是说,玛莉的高度,是以不确定的概率分布,而最有可能的高度就是尖端的 peak 位置。  注意,我不标示真正的高度,而是用 [0,1] 这区间来表示「高」的程度,例如 0.7 是比平均高的, 0.5 则是平均。

但如果不确定性增加的话,那曲线会变得更扁,例如:

还有,机率的特性,就是它分布的曲线下面的总面积一定是 1,因为无论玛莉的高度如何分布,所有可能性的总和就是一个「必然事件」,而必然事件的机率是1。 


Bayesian network

以下是一个 Bayesian network 的例子 :

每一个箭头,如果是由 X 指向 Y,便表示 "Y given X",亦即是「如果知道了 X 发生的话,Y 的机会是多少」。

例如在图中可看到,吸烟影响肺癌、肺癌影响疲倦、肺癌也影响 X-ray 的结果。

在上图中每个箭头,都附带有一些 P(X|Y) 那样的数据,如下表所示:
(不需深究)

P(X1=no)=0.8
P(X1 = yes)=0.2
P(X2=absent | X1=no)=0.95
P(X2=absent | X1=yes)=0.75
P(X2=present | X1=no)=0.05
P(X2=present | X1=yes)=0.25
P(X3=absent | X1=no)=0.99995
P(X3=absent | X1=yes)=0.997
P(X3=absent | X1=no)=0.00005
P(X3=absent | X1=yes)=0.003
P(X4=absent | X2=absent, X3=absent)=0.95
P(X4=absent | X2=absent, X3=present)=0.5
P(X4=absent | X2=present, X3=absent)=0.9
P(X4=absent | X2=present, X3=present)=0.25
P(X4=present | X2=absent, X3=absent)=0.05
P(X4=present | X2=absent, X3=present)=0.5
P(X4=present | X2=present, X3=absent)=0.1
P(X4=present | X2=present, X3=present)=0.75
P(X5=absent | X3=absent)=0.98
P(X5=absent | X3=present)=0.4
P(X5=present | X3=absent)=0.02
P(X5=present | X3=present)=0.6

之所以叫 Bayesian network,是因为 Bayes 这个和牛顿同期的数学家:


他发现了一个公式可以用来计算 P(X|Y) 那样的机率 :


亦可以更为对称地写成:
$ P(A|B) P(B) = P(A,B) = P(B|A) P(A) $

其中 P(A,B) 是 "A and B" 都发生的机率,和 "A given B" 不同。

这公式在初中或高中会学到,不是很难明。

贝叶斯网络的难处,是在於网络把不同的机率连系起来了,所以 Bayes rule 要一个套着一个地连续运用,变成很复杂、很多 sum of products。  但其实如果有耐性和细心的话,那个算法应该不难,因为除了 Bayes rule 与及机率的基本运算之外, 就没有其他了。 


参考资料


以前经典 AI 不用机率,直到 1980's Judea Pearl 写了一本很详细分析 Bayesian network 的书,使它变成主流:



Pearl 的儿子是美国驻伊拉克记者,他们家是犹太人,他不幸被恐怖分子捉了而被割首,过程被摄影下来。  我不能断定他是极端亲美,还是作为一个同情者而被无辜杀害。

另外这个是 Daphne Koller,她是这本新的巨著 (2009 年,1280 pages!) 的作者之一:



但这本书太 technical 和详细,对初学者不好读。

在 Stanford U 的 Daphne 是 Coursera 的创始人之一,她的(免费的)课 https://www.coursera.org/course/pgm 详细地讲解 Bayesian network 的算法和学习法。  

顺带一提,这个是 Peter Norvig,他和 Stewart Russell 合著的 "AIMA" 是现时最好的人工智能教科书。



Monday, November 17, 2014

什么是机器学习? What is machine learning? (Chinese)

机器学习是 AI 最重要部分,但我上次用 Clojure 写 Genifer 的时候没有完成这部分,结果那 prototype 完全没有应用,是一大失误。

「学习」的本质

人脑很多时是通过「一般化 , generalization」来学习的。

例如,广东话俗语说:『有须就认老窦』(有胡子便认做父亲,用来取笑过份的一般而论),但可能婴儿就是这样辨认亲人的。

又例如,小时候爷爷教我写中文的「一二三四五」,学到 3 的时候我便自作聪明地说:「我知道了,四字的写法是 4 划!」  虽然是错误的,但表示小孩子的思想方式。

又例如,有些人被女人骗过一次之后,便永远不信女人。

这些都是 generalization 的例子。

相反的方向叫 specialization (特殊化),或 refinement (精细化)。

例如我们从「所有女人都说谎」修正到「有些女人是诚实的」,或者「那个骗我的女人其实也有诚实的一面。」

一般化 和 特殊化 就是逻辑学习的基本运作,没有别的内容了。

此外还有一种机器学习的範畴,是基於在空间中的点的模式识别


例如我们已经知道有两类东西 (分别标作红色和蓝色),而我们数量化地量度它们的某些特徵,然后在座标空间上点描出来,这时发现红点和蓝点的分布大致可以用一条线分割,於是以后我们只要量度那些特徵,就可以分辨哪些是红组或蓝组的东西,而不需要知道事先知道它们的颜色。

这种「空间中」的统计学习 (statistical learning),其先决条件是知道一些数值上的量度,否则根本没有几何空间可言。  神经网络就是这种 "spatial learning" 的例子。

On the other hand,逻辑学习是不需要几何空间的,它只需要「符号」运作 (symbolic manipulations)。

以下我们专注逻辑学习;  如何统一逻辑学习和空间学习,我觉得是研究的重要课题。


Logic-based learning


例如说,我们观察到「很多读电脑的人都戴眼镜」,但我们是怎样跳到这个「归纳」的结论的?

在逻辑引擎中,已经有的 gound facts (事实资料) 是:
读电脑(小明), 戴眼镜(小明),
读电脑(小娟), 戴眼镜(小娟),
读电脑(小强), 戴眼镜(小强),
读电脑(美玲), 戴眼镜(美玲),
读音乐(小芬),不 戴眼镜(小芬),
男生(小明),男生(小强),
女生(小娟),女生(美玲),女生(小芬),
. . . . . . . 等等。

我们欲求得到的 general rule 是:
读电脑(X) $\rightarrow$ 戴眼镜(X)

其实那算法就是在所有可能的 formula 里面搜寻

换句话说,由最简单的 formula 开始,我们不断 generate 越来越复杂的 formulas,然后我们逐一试验这些 formulas,看看哪一条最能解释事实

最简单的 formula 就是一条「空」的命题 (什么也没有,表示一切都是真的)。

然后我们逐步添加一些逻辑项 (terms)。

例如:
$\rightarrow$ 戴眼镜(X)
表示任何人都戴眼镜,但那和事实不符。

又例如:
女生(X) $\rightarrow$ 戴眼镜(X)
「所有女生都戴眼镜」,那也和事实不符。

最后我们试验:
读电脑(X) $\rightarrow$ 戴眼镜(X)
发现其机率很高 (或许有少数例外),於是接受这一假设。

换句话说,这是在「假设空间,hypothesis space」中的搜寻。

这些 search space (搜寻空间) 的结构,形状像「树」那样不断细分下去:


我们由最「一般」的命题 (什么都是真的) 开始搜寻,到越来越特殊的命题,例如「22 岁、五月生日、体重 70 公斤以上、读大学 4 年级、数学不合格、姓张的人 $\rightarrow$ 都戴眼镜」,这过程中必然会出现我们期待的 formula。  这叫 general-to-specific search。

但由於 假设空间里面 有非常多组合 (combinatorial) 的可能性,所以这种学习是很慢的。

似乎唯一的解决之道,就是把所有概念和法则都分类,例如「这是属於物理学的知识」、「这是关於我女朋友的知识」…… 等,而在搜寻时我们只关注那些最可能有关的集合,例如「买礼物给女友时 考虑 class A 的知识」、「物理学考试时 考虑 class B 的知识」 …… 等等。

虽然很难,但必须把这个算法写出来,否则 Genifer 便无望了!

PS:  我在 Genifer 网页里有更详细解释 inductive learning (PDF, in English)

Sunday, November 16, 2014

如何写一个逻辑引擎?How to write a logic engine? (Chinese)

现试用最简单的方法,介绍如何写一个逻辑引擎。

即使是现时最尖端的人工智能,例如 OpenCog,它的内部仍然是很简单的。

逻辑推理 (deduction) 只需要两个算法:

  1. 命题推理
  2. 配对法 (unification) 
这两个算法只需要数行程式就能表达 !


逻辑是什么?


逻辑可以分解为两部分:
  1. 命题逻辑 (propositional logic)
  2. 命题内部的结构

命题逻辑

命题逻辑的意思是:  我们只关心命题的真假,而不看命题内部的结构。  所以「小明爱小娟」和「北京昨天下雨」都是同一类句子,只要我们知道它们的真值 (truth value)。

命题之间的关系用箭頭表示,例如:
  1. 吃了不洁食物 $\rightarrow$ 肚痛
  2. 下大雨,没带雨伞 $\rightarrow$ 变落汤鸡
但命题不一定非黑即白,关系也会有不确定因素,所以最严格是要用 fuzzy probability (模糊 机率) 来描述。

推理的算法,原理很简单:

所有命题的关系构成一个 graph,例如下图:


有些 nodes 的真值我们知道,有些不知道; 我们将真值由已知的 nodes「传播」到未知的 nodes 去。 这传播有其法则,例如 机率 要用机率论的法则计算,你甚至可以自创,甚至可以用机器学习那些法则,而不需我们伤脑筋去设计!

所以,你只要懂得怎样在电脑建立和处理 graph 结构,第一个问题便解决了。

谓词逻辑

现在来看看句子内部的结构。

最经典的做法是用谓词逻辑 (predicate logic),那是 Gottlob Frege 在 1879 年发明的。  罗素 (Bertrand Russell) 用它改写所有数学,使之发扬光大,然后 AI 之父 John McCarthy 把它应用在 AI 上,变成 classical AI。  但现在我们搞 strong AI 的人似乎都不太喜欢它,使用一些变奏,但基调一样。

谓词逻辑的做法就是将句子拆成 predicate 和 predicate 里面的 objects。
例如:
「小明爱小娟」 -->  爱(小明,小娟)
「北京下雨」-->  下雨(北京)
但对於较为复杂的句子结构例如「小明执迷不悟地爱着小娟」则视乎个别研究者的做法,没有统一的准则。  我们暂时不深究。

其实只要用某个方法把句子内部拆开就行,命题的推理方法完全没有变。

但句子拆开之后,我们可以在里面使用 variables (变量),那是很有用的。

例如,我们可以构造:
爱(小明,X)
那变量 X 可以是 小娟、美玲、等。

於是现在我们需要另一个算法,叫 unification (配对)。

Unification

Unify 的作用是:  给定两个句子 A 和 B,它们内部可以很复杂,也可以有 variables,而我们试图找一些 substitutions (代入),令那两句变成一样

例如:
爱(X,Y)

爱(小强,美玲)
这两句命题,只要代入{X = 小强,Y = 美玲},就可以变成同一句。

当然有些命题是无法 unify 的,例如:
爱(小明,小娟)
恨(小明,X)
那算法便会回覆 "fail"。

变量是很重要的发明,有了变量才可以表达 general rules (一般法则),例如:
「女人都说谎」换成逻辑就是:
女人(X) $\rightarrow$ 说谎(X)

有些人被女人骗过一次之后,便永远不信女人;  所以人脑的思想也是透过 generalization (一般化) 来学习的。  那是机器学习的範围,容后再谈。

Unify 所做的其实是 pattern matching。
例如:
    戴眼镜(X) 或 爱(小明,Y)
是一些模式,可以套到:
    戴眼镜(美玲) 或 爱(小明,小娟)
这些实例上。

人工智能就是一个很大的资料库,里面有很多用模式写成的 general rules,经过配对而产生新的推论。  例如 小娟是女人,所以用上面那条 general rule,得出小娟说谎的结论。 

Tuesday, June 3, 2014

Cayley graphs of free groups and monoids

This is a Cayley graph:



Its nodes represent elements of the free group generated by {a, b}.  It goes out in 4 directions from the origin:  ${a, b, a^{-1}, b^{-1}}$.

This is an animated GIF of the same graph:

But it only recursively repeat in 3 directions, as $a \cdot a^{-1} = 1$ ("going backwards") is cancelled out.

The graph itself is of fractal dimension, but it is embedded on the 2D plane (of dimension 2), that is because the group is generated by 2 elements.

If we ignore the non-commutativity of the group, in other words $a \cdot b = b \cdot a$, then we don't have to shrink the edges, and the Cayley graph becomes a regular grid like this:


This can be regarded simply as the direct product of the generator groups {a} and {b}.  And it clearly shows why 2 dimensions are needed.

But interestingly, we can also plot the Cayley graph of a free group with 3 elements on the 2D plane.

My first attempt is to divide the full circle of $360^\circ$ into 6 points, with 3 points representing {a, b, c} and 3 others the inverses.  But unfortunately there is a "clash" between some elements:


So the solution is to shrink the edges not by 1/2 but by 1/3.  This is the result:


This trick can work for any number of generators, by just cutting 2n divisions of the full circle.  For example this is case n=4:


Case n=10:



Free monoids

What if we ignore the inverse elements, so we get free monoids?  We can have 4 elements {a, b, c, d} and the graph is as follows:


It is clear that the nodes are "space-filling" as every point in the square can be expressed as a binary fraction.

Thus every point uniquely identifies a word in the free monoid generated by 4 elements.  Beware that some edges are overlapping in this graph.

The case of n=3 monoid is a bit interesting, as it turns out to be the Sierpinsky triangle!


When n gets big, the monoid graphs looks exactly the same as the group graphs for n/2.  For example, this is monoid, n=20:


And this is monoid, n=100:



Code

The code is very short, written in HTML5 Canvas and Javascript.  You can download the .zip here.

Why the interest?

I am interested in this because the free groups / monoids can represent logical formulas (when the logic is in algebraic form).  If we add a "gray scale" or "colors" to the nodes, that gives them extra dimensions that can be regarded as truth values.  Thus, an entire knowledge base can be represented by such a continuously-colored graph.

In other words, we have a geometric representation of knowledge elements that can be multiplied and added together.  Multiplication is the monoid action, whereas addition is the scalar addition of truth (color) values.

I will write about this last idea in more details.... later :)

Friday, May 23, 2014

P = NP, my first stab at the problem

[ This idea is inspired by Karmarkar's polynomial-time interior point method for linear programming. But it turned out to be a dead end, at least from my current perspective.  Anyway, I leave the record here for future's sake and it might illuminate other people.  ]

In mathematics, there is a long tradition of using "smooth" things to calculate "discrete, combinatorial" things. A famous early example is Euler's use of Newton/Leibniz's integral to approximate infinite series of the form $ \sum^n_{i=1} f(i) $, also known as Euler-Maclaurin summation [cf the book "Mathematical Masterpieces", Knoebel et al 2007]. A more recent example is Alexander Grothendieck's invention of schemes that builds a bridge between the (smooth) algebraic geometry and (discrete) Diophantine geometry.

So my hunch is to "move the SAT problem to a smooth setting and then apply smooth techniques". In fact, that was exactly what happened for the linear programming problem -- the simplex algorithm is discrete in the sense that it traverses the vertex points on the boundary of the feasible solution space, such a traversal tends to be exponential because the number of vertices grows exponentially as the number of equations. Whereas, the "interior point" methods avoid the boundary and vertices and instead manipulate "ellipsoids" that bound the feasible space.

Using an analogy, imagine a hotel with an exponential number of rooms, where some of the rooms contain the prize, but we have only a polynomial amount of time to check the rooms. On the surface it seems impossible; that is why P=NP is hard. But it is only the size of the ambient space that is exponential; the targets are finitely generated from a description of length N (= the input size); so there is hope.

Now add to the mix the inspiration from "interior point" methods: our algorithm should avoid testing the hotel rooms -- because as soon as we test the first room we may get trapped into testing an exponential number of rooms, just as in linear programming we have to avoid "touching" the boundary. Perhaps we should iterate through the input constraints, updating a certain "data structure" (which could be any mathematical structure in a broad sense), and such a structure would asymptotically approach the feasible solution(s).

The final shape of this data structure cannot contain the entire solution set, since it has an exponential number of vertices, and there isn't enough time to update that many vertices before the answer should be out. It cannot even contain a subset of those vertices, because then some of the vertices have to be eliminated in the process, thus making our function discontinuous. (Topologically, continuous means that the pre-image of every open set remains open).

Our function cannot output [0,1] because the answer may jump between 0 and 1, again discontinuous.

Notice that SAT is equivalent to deciding whether a set of polynomials have solutions in the reals ("existential theory of the reals"). Currently the best algorithm for this is singly exponential [cf: "Algorithms in real algebraic geometry", Basu et al 2006, ch.13].

But our function cannot output the volume of the set of real solutions, as the volume depends on the entirety of the boundary, which we want to avoid. The Oleinik-Petrovsky-Thom-Milnor theorem [cf Basu et al 2006, p.4] set a bound on the sum of Betti numbers of an algebraic set, which is polynomial in the degree and exponential in the number of variables. The first Betti number is the number of connected components of the set. This seems to confirm my intuition that the boundary is of "exponential" complexity and thus should be avoided.

After eliminating these options, there may still be a viable path which is to find the "minimal" real solution in the algebraic set.  That means we take only the real component of z and require that its Euclidean norm, $||Re z||_2$, be smallest.  A slight glitch is that the real part of a complex number is given by $Re z = z + z^*$ where $z^*$ denotes complex conjugation, which cannot be expressed in a polynomial formula.

But I realized that a fatal problem of this approach is that we still have not avoided testing an exponential number of points.  This can be illustrated by this picture:



The regions are defined by a set of $k$ polynomials of maximum degree $m$, in $n$ variables.  We need to find the common intersections (if any exists) of all $k$ polynomials.  Figuratively speaking (as this diagram is not realistic), the potential number of intersections is related to the number of "peaks", ie the degree of the polynomials.

If we must test each "peak" against each other, for all $k$ polynomials, the maximal number of tests would be $m^k = O(m^n)$, ie, exponential in the number of variables.

To put it succinctly, the problem is:  When we test a peak at a certain location, our effort spent at that location does not contribute to testing the peaks at other locations, and we have an exponential number of peak-to-peak combinations to be tested.

This is the general feeling one has when tackling an NP-hard problem.

So, the critical insight that may ultimately solve NP is that we should not waste time testing local cases, but we should exploit the fact that the localities are somehow correlated to each other.

It is well known that 2-SAT is in P and 3-SAT is NP-complete.  In other words, when the degree of the polynomials gets to 3, we get NP-hardness.  Any degree-higher-than-3 problems can be reduced to degree-3 instances, so we only need to consider cubic surfaces.  That would be my next goal.... :)

Tuesday, February 25, 2014

P = NP can be proven?

I have a hunch that P=NP can actually be proven.

It is known that in this chain of inclusions:
$$ P \subseteq NP \subseteq PSPACE \subseteq EXPTIME $$ at least one inclusion is proper (ie, equality does not hold), by the "Time Hierarchy theorem". Other than that, none of the inclusions have been proven to be proper so far. Note that this theorem is proved via diagonalization, ie, Cantor's theorem that the cardinality of a power set must be bigger than the cardinality of the set itself. In other words, the proof is based on cardinality. But (with my limited knowledge and saying this vaguely) I have not known of similar cardinality proofs that are "smaller" than Cantor's. Does it mean that there is only 1 inequality within that chain?