Saturday, July 31, 2010

Current Design

We've reached a point where we're worrying about implementation more than design for a bit, so I thought it might be good to record my understanding of the design being implemented.

Language:

statement := head <- body truthvalue
head := atomicstatement
body := [atomicstatement { & [~] atomicstatement }*]
atomicstatement := predicate(term{,term}*)
term := constant | variable | function(term{,term}*)
truthvalue := (probability, confidence)
probability is in range [0,1]
confidence is in range [0,1]

IE, a statement has a conclusion (which is an atomic statement) in the head, and a possibly empty list of possibly negated atomic statements which are conditions, in its body. It records a probability for the conclusion given the conditions, including a (NARS-style) confidence for that conclusion. (If the list is empty, then we are just asserting something about the head's probability; if it's nonempty, we're asserting a conditional probability.)

Functions may be Skolem functions, which is how the system expresses existential statements. Universal statements are expressed by leaving variables unbound; in this case, the probability associated with the statement is intended to be the probability that a randomly chosen instantiation is true.

A simple query for the truth value of an atomic statement proceeds in roughly the following way. We are trying to construct a bayesian network to answer the query. The query statement is the first node, and each new node is another atomic statement which is deemed relevant; the links in the network come from the conditional rules. At each step, we look for new statements which "match," meaning that they have an atomic formula whose free variables can be instantiated to match the appearance of one of the current nodes in the bayesian network.

When a new matching statement is found, it is instantiated and added to the growing network. In the case of statements with empty condition-lists, matching provides a probability estimate for one of the nodes, which allows belief propagation to occur. When two different rules match at their head, however, (conditions or no conditions) we cannot simply add them to the network; we have a rule combination problem (my prev. post).

The procedure for dealing with rule combination problems that we've decided on at present is as follows. (Rules given for probability only; deciding how to propagate confidence is not hard, but we haven't done it yet.)

  1. If the two statements have conditions which are mutually exclusive, we can just add the two resulting probabilities together. With the simple statement format given above, this is easy to test for: when one rule has A in its condition list and the other has ~A, the two are mutually exclusive.
  2. Failing this, we check for primacy of one rule over another. This is equivalent to my notion of "derived rule" in my post on the combination problem, but we'll be storing it as causal information (YKY has convinced me that causal information really does encompass derivative-ness information). The general idea is to take the more causally primal rule whenever we know that one is more primal.
  3. Finally, if neither of the first two conditions hold, we do rule averaging. My reason for advocating noisy-or was that it allowed logical reasoning, but the present formalism allows logical reasoning with simple averaging, so long as we take confidence into account: a rule conveys 0 confidence when its conditions don't match, so it does not weaken other rules when it averages with it.
Additional random topics.

The weights for rules can be adjusted on-line by counting instances where the conditions are met, and incrementing the confidence based on the probability that the conditions are indeed satisfied in the instance. This rule could (should) be added to the above query-answering algorithm, but it means we are no longer simply building a bayesian network.

The design right now makes totally ubiquitous independence assumptions; it would be interesting to find a way to dynamically relax these assumptions when we have more processing power. It's very expensive to relax too many, but it might be worthwhile to relax a few (and maybe save the results in a way that avoids re-doing the expensive computations).

Tuesday, July 27, 2010

Rule Combination

The rule combination problem is: if we have X->A and Y->A, and we get probabilistic information about both X and Y, how do we integrate the two resulting estimates of A?

The normal bayes net semantics would give us an interval of possible probabilities, based on the probability of X&Y (which we may not know, but perhaps we could use some naive estimate). We could then take the midpoint of this or some such. Call this combination approach "averaging".

Another approach is to use the noisy-or combination rule, which can be thought of as adding the two probabilities together (though we also subtract out the product of the two, which makes sure we don't add up to greater than 1). This is great for making highly mathematical systems (such as higher-order logic) probabilistic, because it captures the monotonic nature of logical reasoning: finding additional rules which apply can only bolster the probability of A, never decrease it. A system based on averaging will not be very good for doing this, because if several arguments don't work very well but one does, then an averaging approach will come out with a not-very-large probability; an additive approach, however, will come up with a large probability (as seems right).

A purely additive approach is limited, however: sometimes we want additional information to detract probability! The noisy-and rule does this. Noisy-and adds together the probability that an event won't happen that are contributed by the individual rules, in the exact way that noisy-or adds the probabilities that it *will* happen. This makes sense when we want new rules that we find to represent additional reasons that an event might not happen. This can be complement noisy-or if desired, in a way that preserves the additive "mathematical" style of rule combination where it's desired, but allows subtractive combination where it's more appropriate. ("Bayesian logic programming" uses the approach of allowing specific predicates to declare themselves noisy-or predicates or noisy-and predicates; rules whose consequence is an or-predicate will get treated additively, while rules whose consequence is an and-predicate will get treated as subtractive. YKY suggests an alternate approach, in which we declare individual rules to be either additive or subtractive; this seems a bit more powerful, and loses nothing.)

One problem is that the probabilities given by additive and subtractive rules no longer can be learned simply by counting conditional frequencies. Learning does not become impossible (and perhaps not even very difficult), but the meaning of the numbers is no longer very local, since a probability must be calculated as a sum of many rules. (Think about it: if we learned X->A and Y->A by counting the conditional frequencies, our estimate for A in the presence of both X and Y would be way too high.) The averaging approach may allow simpler learning algorithms. Yet, it seems less powerful as a representational choice.

A problem related to rule combination is how we should reason using lifted estimates. If we have a probability for C, a formula which has some free variables, then the Genifer approach is to allow this to be instantiated and consider it a probability estimate for the instantiated formula as well. Supposing it instantiates to B, and we've also got a rule B->E, which connects it to a larger Bayes Net we're making in memory, the way to handle this is clear: it's fine, even necessary, for initial items in Bayes Nets to have probabilities like this. Suppose, though, that it instead instantiates to a formula that's in the middle of a network, Q->B->E. How do we use the info now? Again, we have an information combination problem. Similar thoughts about averaging, additive, and subtractive approaches apply.

Fortunately, these two problems are just examples of a single more general problem. Assume we have rules X->A and Y->B, where A and B unify to C (that is, instantiating either A or B can result in C). Further, treat unconditional probabilities (like A) as just conditional probabilities where the list of conditions is empty-- that is, X , Y. or both may be tautologous. This is the general form of the combination problem.

If this was all there was to it, it would boil down to that choice: averaging vs additive/subtractive combination, or perhaps some mix of the two. (We could declare rules to be either a frequency, a reason-for, or a reason-against; but we'd then need a special combination rule to tell us what to do when we had a mix of frequencies and additive/subtractive rules, and it's not clear at all what combination would be appropriate... the result would seem rather hackish.) However, there are further complications.

It seems appropriate to sometimes derive rules from one another. A->B and B->C can combine to give an abbreviated rule A->C, summarizing what knowing A does to C's probability. However, all of the combination rules so far will mess up if we allow such derivations to take place. An additive (or subtractive) rule will now take the additional rule into account as well, resulting in twice as much modification from A as is merited. An averaging rule will not do that, but if we have other knowledge about B, then an averaging rule will fallaciously discount its importance. The effect will multiply as more derivations are allowed. It seems that, essentially, we are taking the same information into account more than once. This suggests that we should avoid applying two different rules if we know that one is derived from the other. Ideally, it seems we should prefer the original rules to the derived rules; however, that's only the case when we have enough time to reconstruct all the steps which are accounted for in the derivation (to check them for the particular situation); before that point, it's less clear when we should trust which rule (a partial derivation may already take into account critical situation-specific info that would invalidate the derived rule, but on the other hand, a derived rule may summarize a long and valuable chain of reasoning which is essentially sound but which would take too long to replicate). Some heuristic for choosing between the two should be employed.

In other words: the rule combination problem is solved by first asking if one rule is derived from the other. If not, either averaging or additive/subtractive combination can be done (a choice which has to be made for the system); if so, a heuristic is employed to choose between the two rules.

This has a strong flavor of NARS revision, for those who know NARS.

YKY has pointed out an interesting example problem to use for thinking about rule combination. I will give two solutions. The problem is as follows: Sue is punctual. Punctual people arrive on time. However, Sue dies in a car crash on the way to a meeting. Dead people do not arrive on time. This is represented with "P" ("Sue is punctual"), "P->A" ("If someone is punctual she'll probably arrive on time"), "D" ("Sue is dead"), "D->~A" ("Dead people probably do not arrive on time"). Assume all the rules get properly instantiated to talk about Sue, and of course to talk about some other specific variables we're hiding (what time it is, arrive where, etc).

We get a very bad result if we just apply noisy-or: noticing that Sue is dead actually will increase her chances of arrival slightly! We get just a little better by averaging, but still not satisfactory. One good solution here is to model the rule for death as subtractive: being dead subtracts from your chances no matter what they were. This gets a good result; that is, it'll be very near zero.

Another solution, though, can be hit upon: perhaps death does not override punctuality because it is a negative rule, but rather, because the punctuality rule must ultimately be seen as derived from a chain of reasoning which includes the assumption that the subject is alive! This seems more fundamentally right to me. It also brings up an idea: it might be the case that rules entered in the starting knowledge base may not all be "basic" in the sense of non-derived: rather, it may be that some rules should already be noted as derived from one another, even if we don't actually have all the rules necessary to replicate the derivation. Consider a system like Cyc. Cyc has different "layers" of concepts, ranging from physical concepts (hard, soft, long, short...) to social concepts (like, dislike, ...). The idea I'm putting forward is that in a probabilistic version of this sort of KB, it might be useful to tell the system that social phenomena ultimately depend on physical phenomena, even if we can't trace out the lines of dependence in a precise way. What this does is lets the system know that physical facts will override social facts when the two suggest conflicting outcomes; this will usually be the correct inference. (There are exceptions, however; biological organisms tend to systematically violate rules of thumb which work for simpler physical systems, such as dissipation of heat (through homeostasis), Newton-style physics (through self-guided motion), et cetera.)

In fact, it might be useful to be able to learn the dependence relationships of rules without actually having derived one from another... This would reflect the way that we assume, based on large amounts of evidence, that biology is just a very complicated dance of chemistry (ie, biological rules follow from combinations of chemical rules) without actually knowing how everything works. That seems like dangerous territory, though, so it'll have to be a thought for later...

Monday, July 26, 2010

Causality

Why causality is not completely captured by Bayesian networks

As [Koller & Friedman 2009], Probabilistic Graphical Models, Ch 21, explains:  We know that a Bayesian network is directed, but the direction of the arrows do not have to be meaningful.  They can even be anti-temporal.  On the other hand, it is common wisdom that a "good" BN structure should correspond to causality, in that an edge X -> Y often suggests that X "causes" Y, either directly or indirectly.  Bayesian networks with a causal structure tends to be sparser and more natural.  However, as long as the network structure is capable of representing the underlying joint distribution correctly, the answers that we obtain to probabilistic queries are the same, regardless of whether the network structure is causal or not.

It seems that causal relations cannot be captured simply by probabilistic relations but require some form of inductive algorithm to obtain, such as the IC (for "inductive causality") algorithm proposed by [Pearl 2000].

Aug 2010 Update:  a new set of slides on causality.

Tuesday, July 20, 2010

GOLOG

Abram,

GOLOG sounds similar to your idea of encoding actions in the logic.

There are 2 types of terms:  fluents and actions.

Examples of fluents are:

Fluents evaluates to true or false.  For example At(room1) = true.

Examples of actions are:
eg, Go(up),  Pick(package1, bag1).

And a GOLOG program can consist of these statements:

These statements can be combined with regular Prolog statements to create a new type of programming, but I'm not sure if it includes automated planning. (Will update later when I finish the lecture notes, Action Programming Languages by Thielscher, 2008).

Friday, July 16, 2010

planning paradigms

I plan to unify the following planning paradigms (this is just a draft). Let me outline the scope and try to ask / answer some basic questions:

The paradigms are:
  1. DP (deductive planning) -- in which planning is reduced to inference
  2. RL (reinforcement learning)
  3. classical, state-space-search planning (eg STRIPS)
  4. partial order planning, GRAPHPLAN, HTN (hierarchical task networks)
  5. utility theory, decision theory
  6. decision networks / influence graphs
  7. MDP (Markov decision processes), POMDP (partially observable MDPs)
  8. decision-theoretic agents
  9. "action theories" -- how to specify actions in logic
Some basic questions:

DP

Should planning subsume inference, or inference subsume planning, or shall we have both?  Are there advantages (or disadvantages) if we have both mechanisms interleaved?  Or maybe it'll just create a mess -- "techniques must not be multiplied beyond necessity?"

A basic question is whether DP can mimic the mechanisms of:
  1. decomposition
  2. partial order planning
  3. hierarchical task planning
[ TO-DO:  insert architecture diagram ]

RL

My latest conclusion is:
      DP + inductive learning   subsumes   RL

RL is the learning of an optimal policy.  A policy is a set of mappings:
        states -> actions

In deductive planning, we can have backward-chaining from a goal, but also forward-chaining from the current state.  The latter approach seems equivalent to RL, especially when augmented with inductive learning, where the learner can learn the rules that functions like the mappings above.

However, I should add that DP + IL (inductive learning) only mimics RL functionally, but is not equivalent to RL algorithmically.  Perhaps there are some merits in the RL algorithm?

One thing can be said for sure:  RL cannot co-exist simply with another planning method without creating conflicts.  In my earlier analysis I concluded that the planner should override RL whenever it has solutions, since it is knowledge-based and more sophisticated.  RL is useful when the planner is clueless, particularly in areas where knowledge is scant.

MDPs


What is their relation to classical planning?

To be continued...

Thursday, July 15, 2010

Thoughts on Goals

These are some thoughts on what might constitute an effective goal system.

One use-case I like to keep in mind is inference guidance. The goal system would not necessarily really be called upon for guidance at every inference step (this might be too expensive), but imagining that it was, we'd like to have a system that could perform fairly well.

The basic case, then, would be to think of a query as a goal and create a system which would respond to it in a manner similar to our existing inference algorithm (utilizing some background knowledge about how to achieve this goal).

Basically, we'd want to tell the system that if we have a goal of estimating variable A, and we have a rule involving a variable that unifies with A, the other variables in that rule become subgoals and we can get a better estimate for A by improving our estimates for them and applying belief prop to calculate A. This is a fairly messy statement, because it implicitly relies on an argument of convergence. Basically, the idea is that taking more rules into account tends to produce a better estimate... but this is not always the case, of course. The idea we're employing is something like "taking all the rules into account would result in the best estimate we could come up with, and chains of reasoning with more elements will tend to get closer to this..."

The line of thought from the basic facts to the appropriate action would be highly nontrivial. As a general principle, I think it's usually infeasible to plan an action from first principles when we need to decide what to do. This means (roughly) that the framework of utility theory, which tells us that we should take the action which maximizes the expected utility, is insufficient. In principle we can do with just declarative knowledge, but in practice, we need procedural knowledge.

Genifer currently uses declarative rules, which tell her how facts follow from other facts. What I'm suggesting is that we also use procedural rules, which would tell her what to do when. Procedural rules would, put together, resemble programs.

The idea here is that rather than deciding what to do for each new problem based on a planning algorithm, the system should always be in the process of constructing an overall policy, a program which it uses to make decisions. This reinforces the theme of automatic programming in Genifer.

The basic goal is to construct a set of rules which imply the highest success rate (the highest expected utility). Updateless decision theory tells us that we should calculate the expected success rate in our earliest prior distribution, that is, refusing to update on anything we learn. This technically loses nothing, because we can create rules "If you have learned such-and-such then ...", and the method solves a few problems with conventional decision theory. Notably, though, conditional decision theory (in which we look for the policy with the highest expected value given all that we know) works just fine in most cases, and would probably be faster to reason about.

Either way, the basic paradigm for converting from declarative knowledge to procedural knowledge is to search for rules with high expected utility. Unfortunately, the utility of rules are interrelated, because we've got to know how we will behave in the future in order to know which action is best to take now. This problem has standard treatments in the reinforcement learning literature, however.

The basic idea, then: whereas a declarative rule is in the form statements -> statements, and the conclusion carries probabilities, and imperative rule is in the form statements -> actions, and the actions carry (estimated) expected utilities.

What about goals and subgoals? It makes sense to think of a goal as a utility-bearing statement which is not under our direct control. We want to achieve this goal by our actions; an action is something which is under direct control. So if we generalize the above a bit, allowing imperative rules to be arbitrary "statements -> statements" rules, we get goals and subgoals automatically via the formula "subgoal -> goal." A top-level goal is a statement given unconditional utility (that is to say, utility not derived from any rule).

[Edit: this is a declarative rule subgoal -> goal,  which should be accompanied by an imperative rule trying(goal) -> trying(subgoal), or something to that effect.]

Top-level goals should have basic utility, which cannot be changed except by us. They may also have expected value, which is the sum of the basic utility and the additional utility caused by the future consequences of achieving that goal. Subgoals and actions will generally just have expected value. Basic utility can be thought of as the declarative, decision-theoretic measure; expected value, on the other hand, serves the purpose of our imperative measure. This is because expected value is a local measure which can be used immediately to make decisions. Essentially, the imperative program we follow emerges from our imperative rules via the rule of always taking that action which possesses the highest imperative value.

Fleshing out this solution will draw on ideas from reinforcement learning, and also may draw on ideas from influence diagrams (the Bayes Net answer to utility reasoning), causal bayesian reasoning (though causal decision theory solves less problems than updateless decision theory!), and other areas.

[Edit: the declarative/imperative distinction I am making is somewhat messy. I should instead distinguish 3 classes of information: information about probability, information about utility, and information about expected value. "Information" takes the form of both rules ("B->A") and naked statements ("A").]

Monday, July 12, 2010

Feature Matrix

YKY has highlighted the modules essential to the Genifer prototype.
SeH has included many more features.  Feel free to suggest more!

Sunday, July 11, 2010

Eliminating my HOL "Meta-Rules"

HOL systems, as I've mentioned before, are not usually specified just as inference rules on statements. Instead, they are specified indirectly, as inference rules on sequents. A sequent is a set of premises and a conclusion, representing a valid argument in the logic. (This is represented symbolically in the form A, B, C,... |- D where A, B, C,... is the set of premises and D is the conclusion. The |- symbol is called the turnstyle.) So, specifying a system via a sequent calculus is a slightly curious practice: rather than just specifying inference rules directly, we specify a system of reasoning which comes up with valid inference rules. Historically, this move to the meta-level is what led to the field we call "proof theory".

My previous approach was to preserve this multi-level structure, calling the sequents "rules" and the sequent inference rules "meta-rules" (since they are rules for creating rules). However, it would be more efficient to eliminate the extra level if possible.

In general, this may be a complicated process. Some sequent derivation rules have special syntactical restrictions, such as "... X and Y share no free variables ...". Also consider sequent rules like "If A,B |- C, then A |- (B->C)". It isn't obvious that there is a simple way to turn a specification of a system in terms of sequent derivation rules into a direct specification consisting of inference rules on statements.

Typical systems, though, will obey a few nice properties. First, adding more premises to a valid sequent will always result in a valid sequent. Furthermore, many sequent derivation rules just propagate along the premises from the sequents they start with to the sequents they produce. For sequents which do that, the job is easy. Consider an example:

"From A, B, C... |- P and D, E, F... |- Q conclude A, B, C... D, E, F... |- (P&Q)"


In this case, we'd just translate the whole thing to the rule "From P and Q conclude (P&Q)". These sorts can be eliminated from the meta-level immediately, which is good (every little bit helps).

Unfortunately, not all cases are so easy. Many inference rules will almost pass along the premises untouched, but make some little modifications to the list. These modifications will usually be to remove specific premises, occasionally to add premises, and quite rarely to modify the internals of premises. The example above, "If A,B |- C, then A |- (B->C)", is one example of this: one premise is eliminated in the process of drawing the conclusion. Getting rid of meta-rules of this sort requires cleverness.

First-order natural deduction systems handle that particular example in a nice way: at any time during reasoning, we may make a new assumption. This can be anything. We keep the list of assumptions that have been made. Now, say we just concluded something, S. One deduction move we can now make is to remove an assumption from the list (call it A), and stick it on the beginning of S with an arrow: A->S is our new conclusion, which is true in the new context (ie, the old context minus one assumption).

This general idea could be extended to any rules which add or eliminate premises: they are seen as just modifying the list of assumptions.

A practical implementation of this sort of idea for Genifer might be to engage in hypothetical reasoning via conditional queries: we make a query with a list of assumptions attached, to be used in addition to the actual facts. However, some issues may come up with respect to the difference between the classical conditional and the probabilistic conditional. I need to consider this more. (The approach is similar to my "rule-level reasoning", but rather than reasoning about rules using conditional queries, we reason about statements...)

This leaves rules which modify the sequents internally. The primary example of this is variable instantiation, but that's handled by unification! So, in summary, it appears to me that my "meta-level" is not needed in Genifer and can be eliminated with a little patience...

All of the above discussion was done with a translation of HOL Light in mind, but could apply to IG or an IG variant equally well.

Saturday, July 10, 2010

Prolog trick again

Abram, this is your example:


This is why it becomes a problem when you look at it from the proof-tree's perspective:

First problem:  failure to explain away

When you get from the top Q to 1, you already get an observed value for 1, this is already sufficient to produce a TV for Q.

When you get down to 2, you also get a baseline P value for "heavy rain", and supposedly it also produces a TV that can be propagated (via Prolog trick, not standard belief propagation) up to Q.

When you get down to 3, it has an observed value, then you try to propagate that upwards, but it encounters 1, which itself has an observed value (= true).  It seems that it cannot override 1's value.  So the effect of 3 seems to get drowned by 1.

All in all, this failure is due to my treating the Bayes links as isolated links.

Second problem:  cyclic proof tree

In principle we should solve this by searching the tree above the node-to-be-added for duplicated nodes, but such a search may be costly, depending on the branching factor of the tree.

That's why, in the past I simply allow the tree to become cyclic.

One cheap solution is to search only for the parent's parent for duplication. That will prevent "1st-degree" duplication, and may be efficient enough.

Friday, July 9, 2010

New design philosophy

One trend I begin to spot is that people often have somewhat divergent desires of what features they want their AGI brain children to have.  So, maybe the philosophy of our project is to let many AGI developers to draw from a common pool of algorithms, techniques, code, gadgets, etc, to DIY.  "No Child Left Behind - 2".

( "No Child Left Behind - 1" is that we cater to both Java and .NET platforms.  OpenCog takes care of C++ so we don't need to worry about that.  And, on the native scene there are some very good languages like Haskell and Erlang, but they don't have a large enough user base to warrant forking a 3rd branch.  )

Cobra implementation of Logic class

This is Chuck's code in Cobra.  I'm still examining it....
============================================================

http://www.pastey.net/138443-2v9q

Thursday, July 8, 2010

implementation: the "Logic" class

Definition of the Logic class:

Formula------------ Constant ------------------------ Atom
                |                                          |
                --------- Variable                  ---------- Char, and String
                |                                          |
                --------- Apply-Operator       ---------- Number ------------ Integer, Real, Bool, etc...
                |                                                                        
                --------- Apply-Functor


Atom:  {  string: name,  int: index  }                                 // atoms are eg: "john", "mary"
Variable:  {  int: index  }
Apply-Operator:  {  char: op,  Formula[]: arguments  }          // op applied to 1 or more arguments
Apply-Functor:  {  Atom: functor, Formula[]: arguments  }    // functor applied to 1 or more arguments
                                                                                      // functor can be function or predicate
                                                                                      // functor's name is just an Atom

Definition of FOL terminology:

This set of terminology is standard for FOL:
  1. At the top level is the formula (or proposition, or sentence, or logic statement, all synonyms).
  2. A formula can be a rule or a fact.  A rule is of the form
         A <- X, Y, Z, ...
    And a fact does not contain the "<-" operator.
  3. Each formula consists of a number of literals joined by logical operators (aka connectives) such as AND and OR, and operations can be nested.  Note that NOT is a unary operator.  Eg:
    • (loves john mary)                                     is a literal
    • (NOT (female john))                                is a literal
    • (AND (loves john mary) (female mary))    are 2 literals joined by AND
    • (OR (AND (p X) (q Y))  (r Z))                  is nested
  4. Each literal consists of a predicate plus its arguments, eg:
    • (loves john mary)                    where love = predicate with 2 arguments
    • (male paul)                             where male = predicate with 1 argument
    • (female (daughter-of paul))      where female = predicate with 1 argument, "daughter-of" is a function symbol
  5. Terms (or arguments) can be:
    • variables   (eg:  X, Y, Z)
    • constants   (eg:  john, mary, 17, "String")
    • functions of other terms   (eg:  daughter-of(john),  sine(3.14))
That's all!

Our simplified terminology:

In anticipation of higher-order logic:
  1. The notion of predicates and functions are combined into functors.
  2. We can also absorb functors into operators, but this is not currently done.
  3. In FOL, a function-application is not a formula, but this is relaxed in our definition.  Potentially it can lead to ill-formed formulas.  For example, sine(3.14) is clearly not a formula;  but we want to allow some functions that return formulas.
The result is more elegant, and we can express this by a set of re-write rules:
     formula ===> constant
     formula ===> variable
     formula ===> (functor formula1 formula2 ...)
     formula ===> (op formula1 formula2 ...)

More about rules

Rules are of the form:
     A <- B, C, D, ...
where A is the "head", B,C,D,... is the "body", and "<-" is just a special operator like "AND";  we can call it "COND" for conditional.
Each "," is equivalent to an "AND".  So, the above rule is represented as:
     (COND A (AND B C D...))
where "AND" can accept more than 2 arguments.

( In our logic, <- translates to a link in a Bayesian network.  For this reason, it seems not very meaningful for a formula to have "<-" inside the body instead of at the head position. )

Some special cases:
  1. If a rule has no head, it is still a rule, such as:
         (interesting X)    means "everything is interesting"
    -- note that all variables are implicitly universally quantified, ie, the above is really:
         for-all X,  interesting X
  2. We can ignore the case where a rule has no body (it's not useful).
  3. A rule without variables is still a rule, eg:
         wet <- rains
         not cloudy <- sunny
Representing variables and constants

Notice that we can use any way we want to represent variables and constants internally.  In my Lisp code, variables are like "?1" etc and constants are "john" etc.  In a more sophisticated KB, we would of course use some kind of indexes.

Code

This is Russell's C# code, and Chuck's Cobra code.  When in doubt, please conform to the definition above.

basic English

This is a basic English lexicon for talking about things in programming.  It seems to get quite large as I think of more and more words to add to it, and I'm not sure if it's too big for the prototype to handle.  For the grammar we can modify the simple English grammar in PAIP.

nouns:
function, program, procedure
number, integer, real
memory, variable, pointer, constant, value, content
data, structure, input, output
list, queue, tree, stack, node, link
loop, iteration

verbs:
add, subtract, multiply, divide
increase, decrease, increment, decrement
move, link, compare
create, delete, update, change, fix, set, assign
open, close, enter, exit
input, output

adjectives:
free, open, closed
fast, slow, simple, complex, hard, easy
big, small, long, short, deep, shallow, broad, narrow

propositions:
on, by, to, in, of, for, with, within, without, against

conjunctions:
if, then, while, for
and, or, not

pronouns, etc:
I, you, he, she, it, they, them, him, his, her, mine, yours
who, what, where, when, how
please, okay, thanks, bye!

Tuesday, July 6, 2010

reasons for using .NET

Seems that the good candidate languages are only C#, F# and Cobra.

Lacks usable Prolog / Lisp / Haskell implementations (JVM doesn't have Prolog / Haskell either).

reasons for using Java/JVM

Clojure -- .NET lacks a good Lisp implementation.

Scala is OO + functional.

question: is Jython good/fast?

Monday, July 5, 2010

HOL update

YKY doesn't want to focus on HOL at the moment, but he suggested I write down my thoughts anyway. I was going to put it off since I'd really want to do a more detailed write-up taking care of all the loose ends, but since there is some interest from Russel, I'm outlining my approach now.

The approach is to translate system IG from the paper Illative Combinatory Logic. This is an untyped lambda-calculus logic which soundly & completely interprets first-order logic both "directly" (ie, as truth functions) and via curry-howard isomorphism (which indicates the strength of the type reasoning the system can interpret).

The way I'm planning on translating it is to treat a statement and its negation as essentially separate statements in the probability distribution; ie, the Liar sentence can have probability 0, but its negation can simultaneously have probability 0. Only when we know a statement is a proposition can we derive that P(prop) = 1 - P(~prop).

After this, what we do is just encode the valid sequents as rules in our existing logic. However, IG has an infinite number of valid sequents, specified with some sequent derivation rules. This means that we've got an infinite number of rules now-- we need an efficient mechanism for constructing valid rules as they are needed. The sequent derivation rules get turned into "meta-rules" in our logic. This is a bit complicated, but better than the alternatives (see my previous post).

One complication is that sequent-based need to be combined with something like the noisy-or combination rule [I need to add a reference...]. At present we're planning on dealing with this by having noisy-or be the default combination rule in the whole system; there are some cases where it is a terrible rule, though. YKY thinks we can determine which cases to apply it to using causal reasoning.

For efficient reasoning, we'd like to normalize terms before reasoning so that we can quickly recognize equality. However, in untyped HOL, some terms have nonterminating normalizations. There are a few things we can do. One is to just have a time cutoff on normalization attempts. Another is to try and do some type reasoning before normalization (so we'd only normalize when we could first prove termination). A third option that might have some interesting properties is what I'm calling "compressive normalization"... rather than the usual normalization, use the shortest form of a term as the normal form. This is non-terminating and we'd still need a time cutoff, but unlike the usual normal form, it exists even for terms which aren't propositions-- it just always exists. It might be useful to keep short forms of terms around for compression purposes.

Speaking of compression; integrating both HOL and likelihood reasoning into Gen means that might be feasible to specify a prior probability distribution over models within Gen itself, meaning iduction just becomes probabilistic deduction. However, I have not worked out details here.

That's all for today! :)

Thursday, July 1, 2010

HOL, Paradox Free: a plan of attack, or two.

Looking at the Illative Combinatory Logic paper today, and thinking about paradox-free HOL.

There are plenty of paradox-free HOLs out there. The question is, how do we translate them into our system? Most HOLs seem to be described in sequent calculus. This means the system has an extra meta-level that ours lacks: it's not really specified as a system for reasoning about HOL statements; rather, it's specified as a system for reasoning about HOL proofs. This means there are at least 3 levels at which the translation can be made: (H) we could try to use the sequent derivation rules as rules in our system, (M) we could use the sequents themselves as rules in our system, and (L) we could use the conditional statements of the HOL as our rules. (I'm labeling these as High, Medium, and Low, in case you're wondering what the letters stand for.)

H might be easiest in that it would only require a finite number of rules to be added to the system. However, it would mean making room for sequents as a type of statement in the system, which could be awkward. We would still want to be able to make basic statements, ie, statements about actual things rather than just about what follows from what... so we would need to also add rules for using sequents to derive base facts from one another. On top of all this, we need to worry about how the probabilistic reasoning is done. It seems that the sequent reasoning would end up being crisp, which means the probabilistic reasoning would only happen when we applied sequents to uncertain base-statements. Again, this seems awkward.

M would require an infinite number of rules, meaning that we'd have to implement rule-level reasoning to derive new rules as they are needed. I discussed rule-level reasoning before; the additional sequent-derivation rules would have to be integrated with my earlier thoughts. For the most part, I think this would mean restricting my earlier suggestion to make sure that it's consistent for the same reasons the sequent rules are consistent. Figuring out the maximum level of rule-level reasoning I can preserve may be difficult. Figuring out the minimum level that's sufficient probably won't be, though! What would be somewhat difficult would be generalizing the backwards-chaining reasoner to create new rules for itself to backwards-chain on (in a sorta meta-backwards-chaining way).

L would require a doubly infinite number of rules (so to speak). The approach would again require rule-level reasoning, but this time rather than implementing a finite number of meta-rules (based on the rules for deriving new sequents) we'd appear to need an infinite number, specified by a finite number of meta-meta rules. This is obviously too much; so, we'd have to pull back and instead figure out a system which satisfied the constraints laid down by the sequent rules. In other words, we treat the sequent-calculus description of the logic as a specification of target features rather than a description of the deduction rules of a system. This is obviously a more difficult approach... an interesting challenge, but more difficult.

Also worth a mention: a tempting, but probably incorrect direction: merge the levels so that we don't have to deal with so many. We'd start allowing conditionals to compose with other language elements, like in regular FOL, so that we can have stuff like A->(B->C). We'd then try to use regular reasoning to implement all the different levels at once. This sounds powerful, but it would probably just create an inconsistent system, defeating the point.




Conclusion

Approach M seems to be the best plan, so I will get to work on that!

Porting Genifer LISP to Scala or Clojure, July 1st

Per YKY's instructions, I am getting ready to port Genifer's existing LISP code located at...

http://launchpad.net/genifer

...to either Scala or Clojure, JVM languages which can inter-operate with Java.  In doing this I hope to provide a framework that will ultimately unify AI libraries like Genifer, dANN, Neuroph, and others.


_KY_: The porting is 3 steps:
  1. unification algorithm
  2. best-first search
  3. substitution management
  4. optional: [ forward chaining ]
  5. optional: [ abduction ]

YKY recommends that I review:
Code will be commited to Genifer's Google Code repository.