Friday, December 7, 2012

Matrix trick and the ontology ball

Genifer's logic has the structure of a semi-ring obeying:


$\mbox{associative addition:} \quad (A + B) + C = A + (B + C) $
$\mbox{associative multiplication:} \quad (A \times B) \times C = A \times (B \times C) $
$\mbox{non-commutative multiplication:} \quad A \times B \neq B \times A $
$\mbox{distribution on the right}\quad (A + B) \times C = A \times C + B \times C $
$\mbox{commutative addition:} \quad A + B = B + A $
$\mbox{idempotent addition:} \quad A + A = A $

In short, multiplication is similar to "concatenation of letters" and addition is like "set union".  In programming (data-structure) terms, the algebra is like a set of strings composed of letters, and the letters are what I call concepts.

Matrix trick

The matrix trick is simply to represent the elements of the algebra by matrices.  (This may be related to representation theory in abstract algebra).

What is a matrix?  A matrix is a linear transformation.  For example, this is an original image:

This is the image after a matrix multiplication:

In short, matrix multiplication is equivalent to a combination of rotation, reflection, and/or shear.  You can play with this web app to understand it intuitively.

So my trick is to represent each concept, say "love", by a matrix.  Then a sentence such as "John loves Mary" would be represented by the matrix product "john $\otimes$ loves $\otimes$ mary".


Each arrow represents a matrix multiplication.  The transformations need not occur on the same plane.

By embedding the matrices in a vector space (simply writing the matrix entries as a flat vector), I can "locate" any sentence in the vector space, or cognitive space.

Keep in mind that the rationale for using matrices is solely because $AB \neq BA$ in matrix multiplication, or:
John loves Mary $\neq$ Mary loves John.

Ontology ball

This is another trick I invented:


It is just a representation of an ontology in space (as a ball).  To determine if $A \subset B$, we see if $A$ is located within the "light cone" of $B$.  Note that fuzzy $\in$ and $\subset$ can be represented by using the deviation from the central axis as a fuzzy measure.

I think this trick is important since it replaces the operation of unification (as in Prolog or first-order logic) with something spatial.  Traditionally, unification is done by comparing 2 logic terms syntactically, via a recursive-descent algorithm on their syntax trees.  This algorithm is discrete.  With this trick we can now perform the same operation using spatial techniques, which is potentially amenable to statistical learning.

Update:  Unfortunately I realized that the light-cone trick is not compatible with the matrix trick, as the matrix multiplication rotates the space which may destroy the $\in$ or $\subset$ relations (as they rely on straight rays projecting from the origin to infinity).  In fact, a matrix multiplication can be seen as a rotation of an angle $\theta$ which can expressed as a rational multiple of $\pi$ (or arbitrarily close if irrational).  Then we can always find a $k$ such that $A^k = Id$ where $Id$ is the identity.  In other words, we can find a product of any word, say "love", such that:
love $\circ$ love $\circ$ love ... = $Id$
which doesn't make sense.

I am still thinking of how to combine the 2 tricks.  If successful, it may enable us to perform logic inference in a spatial setting.  

Thursday, November 29, 2012

Support vector machines

The 3 parts of an SVM

The SVM consists of 3 "tricks" combined to give one of the most influential and arguably state-of-the-art machine learning algorithm.  The 3rd, "kernel trick" is what gives SVM its power.

During the talk I will refer to Cortes and Vepnik's 1995 paper.
  1. linear classification problem, ie, separating a set of data points by a straight line / plane.  This is very simple.
  2. Transforming the primal optimization problem to its dual via Lagrangian multipliers.  This is a standard trick that does not result in any gain / loss in problem complexity.  But the dual form has the advantage of depending only on the inner product of input values.
  3. The kernel trick that transforms the input space to a high dimensional feature space.  The transform function can be non-linear and highly complex, but it is defined implicitly by the inner product.

1. The maximum-margin classifier



The margin is the separation between these 2 hyper-planes:
 $ \bar{w}^* \cdot \bar{x} = b^* + k $
 $ \bar{w}^* \cdot \bar{x} = b^* - k $
so the maximum margin is given by:
 $ m^* = \frac{2k}{|| \bar{w}^* ||} $

The optimization problem is to maximize $m^*$, which is equivalent to minimizing:
$$ \max (m^*) \Rightarrow \max_{\bar{w}} (\frac{2k}{ || \bar{w} || })
\Rightarrow \min_{\bar{w}} (\frac{ || \bar{w} || }{2k})  \Rightarrow \min_{\bar{w}} ( \frac{1}{2} \bar{w} \cdot \bar{w} ) $$subject to the constraints that the the hyper-plane separates the + and - data points:
$$ \bar{w} \cdot (y_i \bar{x_i} ) \ge 1 + y_i b $$

This is a convex optimization problem because the objective function has a nice quadratic form $\bar{w} \cdot \bar{w} = \sum \bar{w_i}^2$ which has a convex shape:



2. The Lagrangian dual

The Lagrangian dual of the maximum-margin classifier is to maximize:
$$ \max_{\bar{\alpha}} ( \sum_{i=1}^{l} \alpha_i - \frac{1}{2} \sum_{i=1}^{l} \sum_{j=1}^{l} \alpha_i \alpha_j y_i y_j \bar{x_i} \cdot \bar{x_j} ) $$subject to the constraints:
$$ \sum_{i=1}^l \alpha_i y_i = 0, \quad  \alpha_i \ge 0 $$
Lagrangian multipliers is a standard technique so I'll not explain it here.

Question:  why does the inner product appear in the dual?  But the important point is that the dual form is dependent only on this inner product $(\bar{x_i} \cdot \bar{x_j})$, whereas the primal form is dependent on $(\bar{w} \cdot \bar{w})$.  This allows us to apply the kernel trick to the dual form.

3. The kernel trick

kernel is just a symmetric inner product defined as:
$$ k(\mathbf{x}, \mathbf{y}) = \Phi(\mathbf{x}) \cdot \Phi(\mathbf{y}) $$By letting $\Phi$ free we can define the kernel as we like, for example, $ k(\mathbf{x}, \mathbf{y}) = \Phi(\mathbf{x}) \cdot \Phi(\mathbf{y}) = (\mathbf{x} \cdot \mathbf{y} )^2 $.

In the Lagrangian dual, the dot product $(\bar{x_i} \cdot \bar{x_j})$ appears, which can be replaced by the kernel $k(\bar{x_i}, \bar{x_j}) = (\Phi(\bar{x_i}) \cdot \Phi(\bar{x_j}))$.

Doing so has the effect of transforming the input space $\mathbf{x}$ into a feature space $\Phi(\mathbf{x})$, as shown in the diagram:



But notice that $\Phi$ is not explicitly defined;  it is defined via the kernel, as an inner product.  In fact, $\Phi$ is not uniquely determined by a kernel function.

This is a possible transformation function $\Phi$ for our example:
$$ \Phi(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2} x_1 x_2) $$It just makes $ \Phi(\mathbf{x}) \cdot \Phi(\mathbf{y}) $ equal to $ (\mathbf{x} \cdot \mathbf{y} )^2 $.

Recall that the inner product induces a norm via:
$ ||\mathbf{x}|| = \sqrt{\mathbf{x} \cdot \mathbf{x}} $
and the distance between 2 points x and y can be expressed as:
$ d(\mathbf{x},\mathbf{y})^2 = || \mathbf{x - y} ||^2 = \mathbf{x} \cdot \mathbf{x} + \mathbf{y} \cdot \mathbf{y} - 2 \mathbf{x} \cdot \mathbf{y}$
Thus, defining the inner product is akin to re-defining the distances between data points, effectively "warping" the input space into a feature space with a different geometry.

Some questions:  What determines the dimension of the feature space?  How is the decision surface determined by the parameters $\alpha_i$?

Hilbert Space

A Hilbert space is just a vector space endowed with an inner product or dot product, notation $\langle \mathbf{x},\mathbf{y}\rangle$ or $(\mathbf{x} \cdot \mathbf{y})$.

The Hilbert-Schmidt theorem (used in equation (35) in the Cortes-Vapnik paper) says that every self-adjoint operator in a Hilbert space H (which can be infinite-dimensional) forms an orthogonal basis of H.  In other words, every vector in space H has a unique representation of the form:
$$ \mathbf{x} = \sum_{i=1}^\infty \alpha_i \mathbf{u}_i $$This is called a spectral decomposition.


*  *  *

As a digression, Hilbert spaces can be used to define function spaces.  For example, this is the graph of a function with an integer domain:
with values of $f(1), f(2), f(3), ... $ on to infinity.  So, each graph corresponds to one function.

Imagine an infinite dimensional space.  Each point in such a space has dimensions or coordinates $(x_1, x_2, ... )$ up to infinity.  So we can say that each point defines a function via
$$f(1) = x_1, \quad f(2) = x_2, \quad f(3) = x_3, ... $$
In other words, each point corresponds to one function.  That is the idea of a function space.  We can generalize from the integer domain ($\aleph_0$) to the continuum ($2^{\aleph_0}$), thus getting the function space of continuous functions.

End of digression :)

Sunday, November 25, 2012

Introducing Genifer, 2012

I started working on AGI in 2004, and my greatest "achievement" up to 2012 (8 years) is the invention of a new logic.  All my work thus far is focused on the thinking aspect of AGI, deferring the aspects of planning and acting.

Concept composition logic


Consider a complex sentence:
     "John walks to church slowly, thinking about Turing"

It can be decomposed into:

  1. "John walks to church"
  2. "John walks slowly"
  3. "John walks thinking about Turing"
Using the Genifer GUI (github code is here) we can manually create a graph that represent the sentence structure as follows:



The GUI automatically generates a "factorization" formula of the graph.

The key idea of the graph is that if a word A modifies word B, we create an arrow $A \multimap B$.  More than 1 word can modify a target, but each word can only modify 1 target.  This ensures that the result is a tree, and can be expressed as a sum of products, using $+$ and $\times$.

We can use the distributive law of multiplication to expand the factorized formula.

A problematic case:  If we translate these 2 sentences into logic:

  1. "John gives the ball to Mary"  and
  2. "John gives $2 to Kevin"
After factorizing and expanding we get:

  1. john gives to mary
  2. john gives the ball
  3. john gives to kevin
  4. john gives $2
so it may be confused that John gives $2 to Mary.  But I reckon that the distributive law is not the real cause of the error.  The error arises because the 2 "johns" are the same John but the 2 "gives" are different instances of giving.

Variable-free logic

This is an example of the logical definition of "downstairs" from the SUMO ontology:


On the left side is the logic formula, on the right the English translation.  As you can see from the translation, it uses a lot of "$n^{th}$ objects" which are logic variables.  But notice that all these variables invariably belong to some classes (such as "building_level").  So we can just do without variables and directly use ontology classes for matching.

For example:
    "All girls like to dress pretty"
    "Jane is a girl"
implies "Jane likes to dress pretty".

In logic the steps would be:
    girls like to dress pretty
    jane $\in$ girls
    $\Rightarrow$ jane likes to dress pretty
because the "girl" atom in both formulas match, so "jane" is substituted into the conclusion.

Parsing of natural language

This is an example parse tree of an English sentence:



A simple grammar rule is:
   "A noun phrase (NP) followed by a verb phrase (VP) is a sentence (S)".

In classical AI, which uses first-order predicate logic, we have to parse the English sentence to obtain the parse tree (syntax parsing), and then translate the parse tree into logic (semantic parsing).  The final logic formula would be like:  loves(john, mary).  The semantic parsing process is complicated and involves using $\mathbf{\lambda}$-calculus to represent sentence fragments.

In our new logic we can just write:
    NP $\circ$ VP $\in$ S

And we have these as data:
    loves $\circ$ mary $\in$ VP                           "loves mary is a verb phrase"
    john $\in$ NP                                          "john is a noun phrase"
    $\Rightarrow$ john $\circ$ loves $\circ$ mary $\in$ S             "john loves mary is a sentence"

Thus, we have achieved both syntax and semantic parsing using just one logic rule (for each grammar rule)!  This is a tremendous improvement over classical AI.  (But I would add that my new logic is just a kind of syntactic sugar for classical logic.  Nevertheless, the syntactic sugar in this case results in very significant simplification of implementation, whereas the classical approach may be so complicated as to be unusable.)

In summary, my innovation provides a new algebraic format of logic that can change our perspectives and open up new algorithmic possibilities.

Friday, November 23, 2012

Introduction to machine learning

[ I'd be giving a talk to introduce Genifer to some Hong Kong people interested in machine learning.  This is the presentation material. ]

Computer vision

Some of you have told me that you're interested in computer vision, so I'll start with visual perception.

The first thing to know is that the computer vision problem is theoretically already solved.  There is no mystery in machine vision:

When we look at a scene, what we have is a image which is a projection of the scene's objects onto a retina.  The mathematics of such a projection is completely specified by projective geometry, with some changes due to occlusion (blocking of light rays).  The machine vision problem is to compute the inverse of the projection, ie, $f^{-1}$.



Of course you also need a way to represent physical objects in the scene.  There are many data structures designed for that purpose, with Clifford algebra (or geometric algebra) being a more mathematical technique.

So why is the vision problem still unsolved?  I guess it's mainly because researchers try to solve the problem using relatively simple or "clever" mathematical tricks instead of really solving every step along the chain of $f^{-1}$.  To do this would require a large scale project, probably beyond the reach of a single professor with a few grad students.

And if someone takes the trouble to solve vision completely, they might just as well tackle the problem of general intelligence (AGI)!

2 type of machine learning

All of machine learning is a search for hypotheses in the hypothesis space.  The resulting hypothesis may not be necessarily correct, but it is the best given the data and based on some criteria ("scores").

There are 2 major paradigms of machine learning:
  1. logic-based learning
    -- hypotheses are logic formulas
    -- search in the space of logic formulas
    -- uses classical AI search techniques
  2. statistical learning (including neural networks)
    -- emerged in the 1990s
    -- data points exist in high dimensional space
    -- use geometric structures (planes or surfaces) to cut the space into regions
    -- learning is achieved through parameter learning
    -- includes neural networks, support vector machines, ...

Neural networks

This is a neuron:
A neuron receives input signals on its dendrites from other neurons.  When the sum of the inputs exceeds a certain threshold, an action potential is fired and propagates along the axon, that reaches other neurons.

Equation of a single neuron

The equation of a neuron is thus:
$$ \sum w_i x_i \ge \theta $$
where the $x_i$'s are inputs, the $w_i$'s are called the weights, $\theta$ is the threshold.




Graphically, each neuron is a hyper-plane that cuts the hyperspace into 2 regions:

Thus imagine multiple neurons chopping up hyperspace into desired regions.

Back-propagation

Back-propagation is a technique invented in the 1970s to train a multi-layer neural network to recognize complex, non-linear patterns.

This is the step function of the inequality $\sum w_i x_i \ge \theta$:

This is replaced by the smooth sigmoid function:

So the equation now becomes:  $\Phi( \sum w_i x_i ) \ge \theta$.

Recall that in calculus, we have the chain rule for differentiation:
$$ D \; f(g(x)) = f ' (g(x)) \cdot g'(x) $$
(The derivative does not exist for the step function because it is not smooth at the threshold point).

Imagine each layer of the neural network as a function $f_i$.  So the chain rule allows us to propagation the error ($\Delta x$) from the lowest layer ($f_1$) to the highest layer ($f_n$):
$$ f_n(f_{n-1}( ... ... f_2(f_1(\Delta x)) ... ) $$
(The formula is just a figurative illustration).  So we can gradually tune the parameters of the neurons in each layer to minimize the error.

Support vector machines

[ See next blog post. ]

Logical / relational learning

This is an example (correct) hypothesis:
$ uncle(X, Y) \leftarrow parent(Z, Y) \wedge brother(Z, X) $

This hypothesis would be too general:
$ uncle(X, Y) \leftarrow $   parent(Z, Y)  $ \wedge brother(Z, X) $

Whereas this one is too specific:
$ uncle(X, Y) \leftarrow parent(Z, Y) \wedge brother(Z, X) \wedge has\underline{ }beard(X) $

Logical inductive learning searches in the space of logic formulas, structured by this general-specific order (also known as subsumption order).

An order (<, Chinese ) creates a structure called a lattice (Chinese ).  This is an example of a lattice:
File:Young's lattice.svg

This is another example of a lattice:

File:Lattice of partitions of an order 4 set.svg

Sometimes we use $\top$ to denote the top element (ie, the most general statement, meaning "everything is correct"), and $\bot$ to denote the bottom element (ie, the most specific statement, meaning "everything is false").

Friday, November 9, 2012

About logic, and how to do it fast

In this article I will talk about 2 things:

  1. The difference between propositional and predicate logic
  2. How to make logic inference fast

What is the difference between propositional logic and predicate logic?

For example, "rainy", "sunny", and "windy" are propositions.  Each is a complete sentence (consider, eg "it is sunny today") that describe a truth or fact, and we assume the sentence has no internal structure.

In contrast, predicate logic allows internal structure of propositions in the form of predicates and functions acting on objects.  For example:  loves(john, mary)  where "loves" is a predicate and "john", "mary" are objects.  They are called first-order objects because we can use the quantifiers $\forall X, \exists X$ to quantify over them.  Second-order logic can quantify over predicates and functions, and so on.

An example of a propositional machine learning algorithm is decision tree learning.  This is the famous example:


All the entries are given as data, the goal is to learn to predict whether to play tennis or not, given combinations of weather conditions that are unseen before.

This would be the learned result:


The same data can also be learned by neural networks, and support vector machines.  Why?  Because we can encode the same data spatially like this:


where the decision boundary can be learned by a neural network or SVM.  This is why I say that decision trees, ANNs, SVMs, are all spatial classifiers.

The situation is very different in the predicate logic / relational setting.  For example, look at this graph where an arrow $X \rightarrow Y$ indicates "$X$ loves $Y$":


That is a sad world.  This one is happier:


As a digression, sometimes I wonder if a multi-person cycle can exist in real life (eg, $John \rightarrow Mary \rightarrow Paul \rightarrow Anne \rightarrow John$), interpreting the relation as purely sexual attraction.  Such an example would be interesting to know, but I guess it's unlikely to occur.  </digression>

So the problem of relational learning is that data points cannot be represented in space with fixed labels.  For example we cannot label people with + and -, where "+ loves -", because different persons are loved by different ones.  At least, the naive way of representing individuals as points in space does not work.

My "matrix trick" may be able to map propositions to a high-dimensional vector space, but my assumption that concepts are matrices (that are also rotations in space) may be unjustified.  I need to find a set of criteria for matrices to represent concepts faithfully.  This direction is still hopeful.

How to make logic inference fast?


DPLL


The current fastest propositional SAT solvers are variants of DPLL (Davis, Putnam, Logemann, Loveland).  Davis and Putnam's 1958-60 algorithm was a very old one (but not the earliest, which dates back to Claude Shannon's 1940 master's thesis).

Resolution + unification


Robinson then invented resolution in 1963-65, taking ideas from Davis-Putnam and elsewhere.  At that time first-order predicate logic was seen as the central focus of research, but not propositional SAT.  Resolution works together with unification (also formulated by Robinson, but already in Herbrand's work) which enables it to handle first-order logic.

Resolution + unification provers (such as Vampire) are currently the fastest algorithm for theorem proving in first-order logic.  I find this interesting because resolution can also be applied to propositional SAT, but it is not the fastest, which goes to DPLL.  It seems that unification is the obstacle that makes first-order (or higher) logic more difficult to solve.

What is SAT?


For example, you have a proposition $P = (a \vee \neg b) \wedge (\neg a \vee c \vee d) \wedge (b \vee d)$ which is in CNF (conjunctive normal form) with 4 variables and 3 clauses.  A truth assignment is a map from the set of variables to {$\top, \bot$}.  The satisfiability problem is to find any assignment such that $P$ evaluates to $\top$, or report failure.

Due to the extreme simplicity of the structure of SAT, it has found connections to many other areas, such as integer programming, matrix algebra, and the Ising model of spin-glasses in physics, to name just a few.

The DPLL algorithm is a top-down search that selects a variable $x$ and recursively search for satisfaction of $P$ with $x$ assigned to $\top$ and $\bot$ respectively.  Such a step is called a branching decision.  The recursion gradually builds up the partial assignment until it is complete.  Various heuristics are employed to select variables, cache intermediate results, etc.

Edit:  DPLL searches in the space of assignments (or "interpretations", or "models").  Whereas [propositional] resolution searches in the space of resolution proofs.  I have explained resolution many times elsewhere, so I'll skip it here.  Basically a resolution proof consists of steps of this form:
           $$ \frac{P \vee Q, \quad \neg P \vee R} {Q \vee R} $$
As you can see, it is qualitatively different from DPLL's search.

How is first-order logic different?


It seems that unification is the obstacle that prevents propositional techniques from applying to first-order logic.

So what is the essence of unification?  For example, if I know
      "all girls like to dress pretty" and
      "Jane is a girl"
I can conclude that
      "Jane likes to dress pretty".
This very simple inference already contains an implicit unification step.  Whenever we have a logic statement that is general, and we want to apply it to yield a more specific conclusion, unification is the step that matches the specific statement to the general one.  Thus, it is unavoidable in any logic that has generalizations.  Propositional logic does not allow quantifications (no $\forall$ and $\exists$), and therefore is not compressive.  The compressive power of first-order logic (or above) comes from quantification and thus unification.

Transform logic inference to an algebraic problem?


I have found a nice algebraic form of logic that has the expressive power of higher-order logic and yet is even simpler than first-order logic in syntax.  My hope is to transform the problem of logic inference to an algebraic problem that can be solved efficiently.  So the first step is to recast logic inference as algebra.  Category theory may help in this.


Unification in category theory


In category theory, unification can be construed as co-equalizer (see Rydeheard and Burstall's book Computational Category Theory).  Let T be the set of terms in our logic.  A substitution $\theta$ is said to unify 2 terms $(s,t)$  if  $\theta s = \theta t$.  If we define 2 functions $f, g$ that points from 1 to T, so that $f(1) = s$ and $g(1) = t$, then the most general unifier (mgu) is exactly the co-equalizer of $f$ and $g$ in the category T.

The book also describes how to model equational logic in categories, but the problem is that the formulas in an AGI are mainly simple propositions that are not equational.  How can deduction with such propositions be recast as solving algebraic equations....?

Unifying thinking and acting

I've made some new friends who may be interested in AGI, so this is a summary of the theory of Genifer so far.

Central dogma

First I assume you understand the central dogma of logic-based AI:  "Knowledge is represented as a set of logic formulas in a knowledge base (KB), and the 3 modes of reasoning are deduction, abduction, and induction."  These basics are explained in all standard AI textbooks, such as AI: a modern approach.

Genifer also has some innovations that enhance classical logic, such as fuzzy-probabilistic truth values and concept composition (similar to combinatory logic and combinatory categorial  grammar), but they are not the subject of this article.

AGI = thinking + acting

If you buy my theory, I am now confident that Genifer's logic is sufficient to cover all the "thinking" aspects, and am also very satisfied with its elegance and simplicity.  What remains is to extend this system to cover planning / acting aspects.

Recently I realized that reinforcement learning can be a suitable over-arching mechanism to unify thinking and acting.

Reinforcement learning (RL)

In a nutshell, RL takes as input a set of actions, a set of states, a reward function for each state, and outputs a policy of actions that maximizes rewards in the long run.  One can also say that RL searches for plans in the plan space.  Richard Sutton and Andrew Barto's 1998 book contains a superb exposition of RL.  What is special about RL is that the problem setting is of an agent acting in an environment, in contrast to other machine learning settings that are just pattern recognition.  Thus, RL is uniquely suitable as the top-level architecture of an AGI.

Recall BF Skinner's behaviorism that tries to reduce human psychology to a set of stimulus-response associations, for example food $\rightarrow$ saliva.  We know intuitively that this is inadequate because higher animals also build cognitive models of the world in their brains.

In abstract terms, the maintenance of such a cognitive model can be viewed as actions that manipulate (brain) memory states.  But due to the astronomical number of possible cognitive states (and thus the number of "cognitive actions"), reinforcement learning cannot effectively deal with cognition -- in other words, it is impractical for RL to automatically figure out how to perform "thinking".  (A critical limitation of RL is that the set of possible actions must not be too large.)  This is where the need for logic comes in.  A logic system is a set of inductive bias that tells us how to maintain truths and perform sensible reasoning.

The need to include a cognitive model of the world in an intelligent agent is already recognized by Sutton and Barto's book (§9.2, "Integrating planning, acting, and learning"), which they called the Dyna architecture.

For your convenience, this is their diagram for the Dyna architecture:



The thinking-acting (logic-RL) interface

From the central dogma we have the 3 modes of inference: deduction, abduction, and induction.  But these are not algorithms per se, rather the forms of search problems.  It is still up to us to design the algorithms that do the searching.  For example, a deduction algorithm may reach a point in the search space where it would be advantageous to ask the user for a key piece of assumption, or to look in the environment to confirm such assumption, which allows to produce a nicer proof.  This is an example of "inference control" that involves user or environmental interaction.  The complex strategies of inference control seem most suitable for RL to learn.

So, we should break down the inference algorithms (deduction, etc) into micro-steps and then allow RL to learn policies of these actions.  For example, RL may tell deduction to perform search for a few layers, then check if the current best solution is good enough under current time constraints, or continue to search deeper.

Meta-reasoning

So far so good, but there is one more problem.  We are using RL to learn "how to think".  Sometimes the decisions on how to think are also dependent on a cognitive model of the thinking process itself.  For example, if the current query is related to quantum mechanics and the AGI knows that it is not good at that, it should more probably consult external references or experts.  Another example is when an internal belief contradiction arises, and the AGI should introspect on how each of the contradicting beliefs was derived, and then decide which one(s) to retract.

Interestingly, the cognitive model of the thinking process itself can be handled by the very logic engine that handles the cognitive modeling of the external world.  The only difference is that the meta-cognitive model would produce "rewards" that guide the RL to perform better thinking.

So, this is my current design for a complete AGI architecture.  My next task is to spell out the details.  Thanks for reading

Friday, October 26, 2012

Playing with WordNet (the code)

This is how I did it:

Getting the word lists

From this Wikipedia page, I got the lists of 100, 200, and 1000 basic English words.

How to access WordNet from Python

Python is particularly good for accessing WordNet, thanks to this book and the NLTK library written by the authors.

After installing NLTK (and the WordNet corpus), in Python, you can just:

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets("cats")

and WordNet will return a bunch of "synonym sets" of the word cat.  In this case they are:

[Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset("cat-o'-nine-tails.n.01"), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')]

Synsets are the basic unit of meanings in WordNet.

To get the similarity distance between 2 synsets, do:

    synset1.path_similarity(synset2)
There are a bunch of XYZ_similarity measures, but their quality appears similar to each other.  I did not try them all, as I found that the overall quality of WordNet as an ontology is problematic.

For more details, read the book's section on WordNet.

Basics of clustering

Suppose you have 1000 items, and you know the distances (ie similarity measures) between every pair of items.  Then the clustering algorithm will be able to calculate the clusters for you.  This is all you need as a user of clustering.  The algorithm is also pretty simple (there are a number of variants), which you can find in machine learning textbooks.  The idea is something like iterating through the set of items, and gradually creating and adding items to clusters.

The input is in the form of a similarity matrix, for example (I made this up, which is different from WordNet's):

catdogmanworm
cat1.0
dog0.61.0
man0.50.71.0
worm0.30.30.11.0

The entries above the diagonal are not needed, because they're just the mirror image of the lower ones.  This is called the lower triangular matrix.  Sometimes the clustering library wants the upper one, then you need to do the matrix transpose, which is just exchanging the i, j indexes.

This is a helpful illustration of transpose:

Sometimes the library function requires you to flatten the diagonal matrix into a linear array.  Doing it row by row to the lower triangle, we'll get:  [1.0, 0.6. 1.0, 0.5, 0.7, 1.0, 0.3, 0.3, 0.1, 0.1].  The mathematical formula for this is:
    cell $(i,j)$ in matrix $ \mapsto $ element $j + i(i-1)/2$ in array
and the whole array is of size $n(n-1)/2$.  Working out this formula may help you debug things if needs be.  One common mistake is confusing lower and upper.

Another glitch is that the library may require dissimilarity instead of similarity values.  So I have to do $(1.0 - x)$.

Looking at the data in a spreadsheet


I find it helpful to generate a CSV (comma separated values) file and use OpenOffice Calc to inspect it, eg:


Where I wrote some BASIC scripts to highlight the diagonal and special values such as 1.0 and words with all entries empty.  For example, I just need to "record macro", color a cell red, move the cell cursor 1 step down diagonally, and then add a For-Next loop around the generated BASIC code.

Calling the clustering library

I tried 2 libraries:


For PyCluster my incantation is:
from Pycluster import *

tree = treecluster(data=None, mask=None, weight=None, transpose=0, method='a', dist='e', distancematrix=triangle)

For R, I need to use a special function "as.dist()" to convert the lower triangle into a distance matrix before calling:
hc <- fastcluster::hclust(dis, method="single")

All my code is available here:
By the way, the code and the directory are unpolished.  This is not my serious project.  I just decide to share the raw stuff :)

Visualization

The output of R can be visualized simply by calling the dendrogram plot:
den <- as.dendrogram(hc)
plot(den, ylim=c(1,n), xlim=c(0,1), horiz=TRUE, dLeaf=-0.03)

The problem with R is that it's not easy to manipulate the visual looks (I couldn't find the relevant documentation online, but the book "R Graphics" may have what I need).

This is the dendrogram made by R (without appropriate labels):
With Python I have more flexibility, by exporting to an external graph visualizer.  I tried:
  • yEd
  • GraphViz (Gvedit)
  • Gephi
  • Ubigraph
yEd is my favorite (whose output you see), and it accepts GraphML files as input.  I just saved a test file of the visual style I want, then emitted my data in GraphML format, from Python.

GraphViz Gvedit is similar, but accepts .DOT files.  .DOT is even simpler than GraphML.  For example, this is a .DOT directed graph:

   digraph G {
       man -> woman
       dog -> cat
   }

But GraphViz's layout function is not so intuitive to use (ie, it requires reading about and typing parameters, whereas yEd provides a GUI.  Of course a lazy person as me eschews that).

Gephi looks quite sophisticated, but unfortunately it is unstable on my Windows XP, and I also tried it in Linux virtual box, also failed.  Perhaps the reason is it requires OpenGL which must be installed from your video card's manufacturer, and my version was outdated and I couldn't fix it.  A nuisance.  If you think your OpenGL version is good, you may try it.

Ubigraph lets you visualize graphs dynamically.  It runs a server which can be easily accessed via remote RPC protocol.  So I can see my graph's nodes spring out in real time, click on nodes and program its behavior, etc.  But, it gets slow after a while as the graph grows big.  Also, the Ubigraph project is now out of maintenance.  Some functions are still imperfect...

Last but not least

Some people have noticed the "loners left out" double entendre in my last post, but it was originally unintended.  Afterwards I realized I've hit on something very interesting :)

I've started to get involved in social entrepreneurship, and came across the idea of the importance of social inclusion.  Milton's phrase "They also serve who only stand and wait" was in the back of my mind when I wrote that.

Saturday, October 20, 2012

Playing with WordNet (first impressions)

What I want is to build an ontology for use with common-sense inference.  An ontology induces a distance measure amongst words (or concepts), which can be used to distribute the concepts inside a high-dimensional sphere, with the concept of "everything" at the center.  This will be useful for my "matrix trick" (which I have explained in some slides, but I'll write about it later in this blog).

So, what I had in my mind was an ontology that looked like this (I just made this up):


But I quickly found out that the reality of ontologies is very different from what I expected!

The first problem is that WordNet has too many words, but I want to load an upper ontology into main memory and still be able to do inference search.  So my idea is to mine the ontology on demand.  For example, if I need an upper ontology of 1000 basic English words, then I can just construct the ontology by looking up the relevant relations in WordNet.

The idea seemed nice, but I found out that WordNet has too many hypernyms (= super-classes) for nouns and almost no hypernyms for adverbs.  For example, look at this visualization of the hypernyms of "cat" and "dog":

We have some surprises such as "a dog is a kind of unpleasant woman".

Also, some meanings are not what we expect, for example, for "cat" to mean a whip, or for "dog" to mean a mechanical device.  Because there is no fuzziness or probabilities, it is hard to distinguish between common and uncommon usage.

The red lines are links that connect "cat" to "dog".  There is the obvious connection via "carnivore" but there are also 2 other connections based on slang usage.

This is just the visualization of 2 iterations of hypernyms.  If we continue like this, the number of hypernyms will grow very large before they start to converge (because everything will ultimately end up as "entity").

This is a visualization of 1 step of hypernyms (blue) for 100 basic words (red):

Notice that at the bottom there are some loners left out of the game because they have no hypernyms (mostly adverbs).  Nevertheless, they are important words and should be included in a good ontology of concepts.

This is the visualization of 2 steps of hypernyms (blue) for 100 basic words (red):

The iteration will terminate ~10 steps for most branches, leaving us with 1000's of nodes and 100K's of linkages.  It is a non-trivial algorithmic problem to prune the graph.  Also disappointing is the fact that this graph will have many more intermediate nodes than the basic English words we started with -- something that I didn't expect.

At this point I gave up and started to try another method:  WordNet provides some similarity measures (such as "path_similarity") between words.  Maybe I can use these distances to cluster the words flatly?

This is a visualization of the similarities between 100 words:

As you can see, again, some adverbs or special words are left out at the bottom.

Also, there is no strong link between "left" and "right" (the strength is only 0.333, relatively low).  This is where I think WordNet's way of defining similarity in [0,1] is wrong.  Similarity should be measured in [-1,1] so that opposite concepts can be measured properly.

My conclusion so far:  "In artificial intelligence, avoid hand-crafting low-level data;  It's usually better to machine-learn!"

Looking forward: we may have to set up Genifer like a baby to learn a small set of vocabularies.  Or, we can use a flawed ontology but somehow supplement it to correct the deficiencies...  (Perhaps WordNet wasn't designed for using adverbs in a way similar to nouns.)

Sunday, February 12, 2012

unification = the calculus of concepts

Today I just had an insight:

Haskell Curry proposed combinatory logic as a logica universalis, but it ran into inconsistency problems.  (I'm trying to use fuzzy-probabilistic truth values to get around that problem, but that's a different topic.)

So, in 1963 JA Robinson discovered resolution, which is really unification + propositional resolution.  Unification decides if 2 terms can be made equationally identical.  Propositional resolution deals with the "calculus of thinking" at the proposition level.

Combinatory logic provides a free way to compose terms via "application".  I regard terms as concepts.  For example:
   "tall handsome guy"
is the combination of the concepts
   tall ∙ (handsome ∙ guy).

Now, a few examples:
"tall handsome guy" is equivalent to "handsome tall guy";
"very tall guy" implies "tall guy";  but
"very tall guy" does not equal "tall very guy";

Thus the unification theory is modulo some special rules akin to commutativity, associativity, etc, and certain reduction rules.  In other words, unification modulo a theory = "the calculus of concepts".

So we have a neat decomposition:
    calculus of thoughts = calculus of concepts
                                    + calculus of propositions

Video: Genifer in 4 minutes

This was made 2 months ago.  Just posting it here so people can find it from our blog: