Tuesday, June 29, 2010

Prolog Trick

My misunderstanding about the nature of the prolog trick has been cleared up a bit, and it is time for an analysis of just what the idea is.

Previously, I thought that the basic idea was simply to construct Bayesian Networks incrementally with a backward-chaining type search inspired by Prolog. However, this is just a part of the idea. The more important thing is the way the probabilities are calculated incrementally as the search progresses. Here's how it works if we support only deduction, not abduction (ie, we only follow baye's net arrows in one direction):

We have a query node, call it A. We might find B&C->A, making new query nodes B and C. We then might find X&Y->C, and find X, and find Y. At this point we have enough information to propagate forward from X and Y to C. We check whether we have enough to propagate even further, but we do not, so we continue the network construction. We could then find B, which would give us enough to propagate again; when we do so, we get A, so we are done.

"Finding X" here means finding a stored probability for the statement X, and propagating along X&Y->C means using the conditional table stored in that BN link to compute a probability distribution for  C.

Now, the interesting thing is, what we're doing here is a single pass of the belief propagation algorithm. A single pass like this is what you do if you only want to marginalize one variable, and of the network you've got is a tree. If you want to marginalize all the variables in a tree, all you need is a second pass, going in the opposite direction; this finishes off the computation for all the other variables with just twice the work. But, no need for twice the work if we only want the one!

If the network isn't a tree, then belief propagation is no longer guaranteed to be correct, but it's been shown that multiple passes (keep doing passes until the numbers appear to converge) can generate good approximations of the correct probabilities in practice. (The exact conditions in which this happens are not well-understood). The prolog trick still just does a single pass for non-tree-like situations, so (in a not-well-understood subset of cases) we should expect worse approximations than we'd get with a bunch of passes. However, one pass will obviously be quicker than a bunch, and it seems reasonable (to me) to expect somewhat graceful degradation.

In any case, here is how you do it. Suppose that node C is connected to rules C&D->A, C&E->B, and F&G->C. We know E, B, F, and G, and from that want to compute C. Suppose also that A is the node which C was invoked in order to estimate, so (via the rules of belief propagation) we ignore it in coming up with our estimate of C. So, then, how do we calculate C?

The basic idea is that we regard the messages coming down as priors and the messages coming up as likelihoods, so (applying Baye's rule) we just multiply them together to get the posterior. That is: the link F&G->C combined with the probabilities F and G will generate a probability for C, call it the "prior" (but keep in mind we're just temporarily treating it as a prior; it's not the prior of C in any long-term sense; indeed, it's an estimate given some evidence). This is a list of C's values and probabilities for each. Using the link C&E->B plus our knowledge of E and B, we can generate a likelihood function: a list of C's values with a probability of the evidence we have concerning B for each. I am going to grind in basic Baye's Law stuff for a sec, in case any readers don't "get it" yet: this is not a probability function. It does not necessarily sum to 1. It is a likelihood function, which means it measures how well different possible values of C would explain the evidence. We multiply it component-wise by the prior on C and normalize the vector to obtain the updated estimate on C (the posterior).

An example. Let's just suppose all these variables have just two states, true and false. Suppose F&G->C stores the following table of probabilities for C being true, given combinations of F and G:

F and G:.1
F and not G: .2
not F, but G: .3
neither F nor G: .4

Then, suppose that we find G and F both have probability .5. We calculate the "prior" for C being true as .025 + .05 + .075 + .1 = .25, with the value for C being false coming to .75 similarly (or by just taking 1-.25).

Now, suppose C&E->B stores the following probabilities for B:

C and E: .6
C but not E: .7
not C, but E: .8
neither C nor E: .9

Further suppose we know that B and E have probability 1. We calculate the likelihood for C being true as .6, and the likelihood for C being false as .8. [The likelihood of C being true will be: ((the probability of B being true supposing C is true and using our probability for E being true or false) times (the probability we had for B being true to start with)) plus ((the probability of B being false supposing C is true and using our probability for E being true or false) times (the probability we had for B being false to start with)). The likelihood for C being false will be the same replacing "supposing C is false" with "supposing C is true".]

Now, we multiply the two vectors together and normalize! The prior is [.25, .75] and the likelihood is [.6, .8]. Multiplying, we get [.15, .6]. Normalizing, we get [.2, .8], which is our final message for C which we'd pass along to help construct a probability for A.

To sum up: adding abduction to the prolog trick calculations can be accomplished via the belief prop rule, which says to multiply the deductive value (what I called the "prior" value) by a likelihood vector, and then normalize the resulting vector.

Saturday, June 26, 2010

Programming language choice

I'm very sick of the prog lang debate -- we seem to be spending 70% of our time on it instead of AGI!

Currently our strategy is "mixed language".

Acceptable languages are the "standard" ones:

Standard version .NET? Java?
Lisp/Scheme IronScheme Clojure
Prolog -- --
ML/OCamlF# OCaml-Java
Haskell -- --
-- -- Scala
Python Cobra Jython

where "--" means under development / coming soon / still immature.


Careful Ascent

YKY has suggested that the strategy will be as follows:

  1. 1. start with horn clauses, replacing classic conditional with probabilistic. This relatively simple to reason about, even including abduction and lifted reasoning, with a bayes-net construction algorithm that YKY calls "prolog trick".
  2. Add in negation. If we were dealing with classical conditional, we'd now have full FOL clauses-- so we've got what can be thought of as a probabilistic generalization of FOL clauses. The extra complication to the Bayesian networks is trivial (and doesn't need to fall back on negation-as-failure), but I want to examine some less obvious complications.
  3. Where the "prolog trick" uses FOL resolution (to minimally instantiate a rule so as to apply to the present case), put in HOL resolution so that we can start using HOL predicates and probabilistically generalizing over HOL variables. (The second one there is exciting.) This also seems simple, except for the bit about having to implement HOL unification. Again, I'll see about complications.
Complications with #2

Completeness for FOL and completeness for horn-clauses are, to an extent, different beasts. Horn-clause reasoning is complete with respect to the goal of establishing "concrete" facts (statements which do not include alternatives). FOL completeness, in contrast, wants to be able to reason about truth values for any statement that the system can express.

It isn't hard to apply FOL resolution to Horn clauses to deduce more rules which are consequences of the existing ones (in fact, it's quite easy). It's also not hard to allow conditional queries to the Bayesian networks which we'll be constructing, which would allow us to deduce new rules from existing information.  My worry, though, is that we'll be able to have two rules whose information cannot be integrated by the system.

An example:

A1: P(C|A&B)=1
A2: P(~B|C)=1

We should be able to deduce P(~B|A)=1.

C1: P(A&B&C)=P(A&B) [from A1]
C2: P(B|C)=0 [from A2]
C3: P(B&C)=0 [mult. C2 by P(C)]
C4: P(A&B&C)=P(A|B&C)*P(B&C)=0 [decompose P(A&B&C) and then apply C3]
C5: P(A&B)=0 [from C1, C4]
C6: P(B|A)=0 [divide by P(A), assuming it's not 0]
C7: P(~B|A)=1

Let's call this "rule-level reasoning". YKY has already asserted (private email) that this sort of reasoning is needed for abduction, although I disagree with him there: I think abduction should be handled by the standard Bayes Net belief propagation, rather than by reversing the direction of a rule via probabilistic reasoning. This means that I think, when we do "prolog trick" construction of the baye's net, we don't just look for rules which have consequences which are nodes in our current network; we also look for rules whose premises are in our current network. When we glue those rules onto the network, the consequences are added as uncertain variables, which means we now try to determine their truth. If we find any evidence one way or the other, it gets added to the network, and the network starts doing abductive reasoning automatically via belief propagation.

My impression is that YKY prefers the idea of looking for rule candidates in basically the same way, but before adding the rule to the network, use rule-level reasoning to change its direction so that all reasoning is deduction. I don't see the point of this, but I do agree that rule-level reasoning is needed when we can't paste a rule into the network without creating a directed cycle.

How should rule-level reasoning be implemented? I think the right way of doing it is just to implement the possibility of conditional queries (a standard belief propagation task, which would complicate "prolog trick" network construction just in that we'd be looking for links between the query variables). When we want to perform rule-level reasoning, we just make a query that is in the form of the rule we want to construct. The resulting probability is the conditional probability for the rule! Easy as pie, and we don't have to worry about all the probability algebra stuff I did in the explicit derivation above. (This gets approximated by belief propagation.)

This does not quite tidy up the completeness concerns. Essentially, I want to know that the reasoning is (an approximation of) complete reasoning given the Kolmogorov axioms and the semantics for our probabilistic generalizations, ie, "the probability of a statement with unbound variables is the probability that it will be true under a random instantiation."

Bayes net reasoning (belief prop) is approximately complete for propositional Horn-with-negation networks so long as the rules form no directed cycles. (It is precisely complete if there are no cycles at all. Actually, it's also been proven to be precise for networks with single cycles. The precise conditions for belief prop giving exact answers are unknown.) With rule-level reasoning, it should be complete for rule-sets with (correctable) directed cycles as well. If that were established rigorously (showing for example that this sort of reasoning will not accidentally use a rule twice in different forms and get the probabilities wrong as a result), then what would remain would be precisely the question of whether our reasoning system is complete over our definition of probabilistic generalizations.

An approximate reasoning method that would be "obviously complete" would be to actually take random instantiations and use the frequency of their truth (ie, their average probability) to decide the probability of the generalization. YKY has proposed essentially this technique for the estimation of a generalization's probability. However, we also want to do lifted estimation: deduce the probability of a predicate as a marginal probability in the lifted network constructed by instantiating rules only enough to apply to the predicate. This lifted estimation is certainly correct under the assumption that there is no other relevant information, but certainly not correct for all cases. At present we have no formalism for combining the lifted reasoning with random sampling or other sources of information.

We need one.

(Tackling this will probably be the topic of my next post.)

Moving on to HOL...

Complications with #3

My reservations here arise essentially from my desire to take the "descent" approach I mentioned in my last post: start with system IG, an untyped HOL, and create a "prolog trick" reasoning system compatible with that. YKY's approach is an "ascent" path, which catches me off-guard as to how we should approach soundness and completeness.

HOL is never classically complete, but it can be complete in various weaker senses. A popular option is to use Henkin semantics rather than classical semantics. IG is known to be sound and complete w.r.t. its FOL subset, which is not quite as good; as far as I know, it's Henkin soundness and completeness is an open question. Another appealing approach would be to show soundness and completeness with respect to an interpretation of ZFC, typed HOL, or another standard system. We could investigate other untyped HOLs to find systems complete in stronger senses if we wanted.

That's all for today. :)

Why "uncertainty on top of logic" is bad

I have made this point before, but this is a better way to put it:

If we put "logic on top of uncertainty", then the uncertainty calculus operates underneath the logic and the logic cannot "see" its workings, ie, the uncertainty calculus is invisible from the logic's POV.  This is an advantage, as we can make queries about logic statements and the uncertainty calculus will work "behind the scene" -- we don't have to worry about it at the logical level.

Contrast this with "uncertainty on top of logic".  Now every statement in the KB is shrouded within a layer of uncertainty calculus.  We cannot query statements directly, because some nodes' TVs will be buried in probabilistic functions as their arguments.  Seeking the TV would involve finding the inverse of those probabilistic functions -- which can be extremely complicated for a logic to handle automatically, ie:
     given          X = f(A, B, C, D, ...)
     where          f  is the probability function
     solve for     A     in terms of X, B, C, D, ...

Friday, June 25, 2010

Use cases

It's kind of hard to give detailed use cases, so I'll try to explain the main principles.

How can Genifer be taught?  A user can ask a query and force an answer.  Then Genifer will generate a number of hypotheses in the KB that enable to infer that answer next time.  Since some of these hypotheses can be wrong, the user should provide multiple examples to reduce the chance of wrong but accidentally correct hypotheses.

For example:
[ Sorry that this example is not related to programming... it's easier to think of a daily-life example.  But I'll try to use programming examples as much as possible. ]

    User:  "Given:  John eats spaghetti with Ben.  Conclude:  Ben eats spaghetti"
    User:  "Given:  John eats spaghetti with Paul.  Conclude:  Paul eats spaghetti"
    User:  "Given:  John dances with Mary.  Conclude:  Mary dances."

    User:  "Given:  John eats spaghetti with meatball.  Query:  Does meatball eat spaghetti?"
    Genifer:  "Yes."
    User:   "That's incorrect."

    etc  etc ....

After the above session, an example rule that Genifer will have learnt is:
    A VerbPhrase with B /\ B is a person -> B VerbPhrase
Notice that this rule may still have exceptions, but it is a useful rule.


Thus, inductive learning allows even lay people to teach Genifer.  This is very important in our strategy to let online users contribute to the KB.


Some more examples with the word 'with':

  • "factorial n" matches with "factorial 6"              (pattern matching in Haskell)
  • The function is evaluated with n = 6
  • We can surround the function name with back-quotes
  • [x,y,z] is a list with 3 elements
  • Lists can be built with the : operator.

What needs to be taught?


Two areas:
1.  semantics of basic / restricted English (with a vocabulary restricted to programming)
2.  knowledge about programming

Possible first application areas:

  1. Find-and-replace, eg: "Find names of variables"
  2. Style checking (a good candidate because it only requires inference and pattern matching, no acting)
  3. String manipulation, eg: "Convert to camel case"
Sorry that this is still very vague...  I have not figured out everything myself and need more time to think on it...

Thursday, June 24, 2010

Probabilistic HOL

Where We Are
The formalism we will be using for the horn-clause subset of first-order probabilistic reasoning will be something close to the following. (Propositions may end up being either crisp or fuzzy or both in the initial version. For stating crisp propositions, we have predicates which are boolean predicates, and boolean truth functions (at least and) to make sentences. For stating fuzzy propositions, we have continuous functions which are predicates, and fuzzy truth functions (at least and) for combining them. The choice between the two will not matter in what follows.)
  1. Predicates, or more generally sentences with free variables, can have probabilities associated with them. These represent the probability that they will be true on random arguments (random instantiations of their variables). I call these "unconditional probabilities" since the system will also store conditional probabilities. They are also called "marginal probabilities." We can specify ahead of time this prior probability for some predicates, and explicitly conditional probabilities, elsewhere. Together, these probabilities can be thought of as the probabilistic version of universal quantification that will be employed.
  2. Existential quantification could be represented by allowing statements to the effect that the probability of a predicate is nonzero; however, we will be using skolem functions in any case, and it seems (to me) more reasonable to always represent existential quantification with skolem functions. (Eliminating zero from the range does not eliminate places arbitrarily close to zero, so the meaning of such a statement is unclear to me. Perhaps existential statements could be generalized to say "the probability is not below x" for typically-small x, but that introduces interval probabilities.)
  3. Conditional probabilities will be employed, which will allow the construction of Bayesian networks (along with the unconditional probabilities, which get used for the initial nodes in the network). Predicates, variables, constants, skolem functions, &, and a conditional put together give the capability to express Horn clauses. Conditional assertions play the role of rules, and unconditional assertions (which can be seen as just assertions with empty conditions) play the role of ground information which the rules act upon; note, however, that since unconditional assertions can still have variables, lifted inference (inference which reaches general conclusions rather than only ground conclusions) can take place via the same algorithm that would do the ground inference. Lifted inference can be approximated to get the marginal probabilities for predicates and partial instantiations of predicates which we might otherwise not have marginal probabilities for (ie, when we did ground reasoning later we'd be forced to instantiate more bayes net than we might want to for efficiency, because some predicate needs more information to generate a probability estimate for it).
 Where We May Go:
We will eventually want to move on to full first-order logic (FOL) and higher-order logic (HOL). These are some speculations on how.

  1. In the resulting system, skolem functions and skolem constants are treated as always existing, but not necessarily meaningful: they may not be equal to any entities in the domain that we are actually interested in, ie, they may have an "undefined" value. (This happens when they exist in an uncertain generalization; if the generalization turns out to be false in some particular case, the skolem-function entities are "stranded" and do not take place in further meaningful reasoning.) This is consistent with  the HOL approach we want to take, ie, resolving paradoxes by accepting that some expressions have an undefined value.
  2. For FOL, it is sufficient to generalize the above treatment from Horn clauses to general clauses.  If we add negation to the mix, then we've got general FOL clauses on our hands. This is the case even if we give up conditionals, though, which may indicate that we're adding too much power; we want most of the reasoning to stay within probabilistic conditionals so that it can be performed by Bayes nets. Restricting this, we can allow negated predicates but not arbitrary negations of larger compounds. This still corresponds to general clauses in the crisp (p=1) case, and Bayesian networks can easily handle negation of individual variables. However, it becomes unclear that the same reasoning method is complete for the expanded logic. So, I need to think about this more, as well as the several other ways to generalize away from strict Horn clauses.
  3. Right now, I am trying to come up with a higher-order logic without going into higher-order probabilities. (2-level PZ is not what I'm worried about; I'm worried about unbounded levels.) Essentially, what's desired is probabilistic reasoning about HOL. This means that although we may speak of a HOL statement having a probability as a value, it does not have that "value" in the HOL sense. Otherwise, the existence of higher-order functions would practically force us to encode higher-order probabilities. Maybe this is the right way to go in the long run, but for this round I'm trying to come up with a relatively simple system...
  4. The "ascent" strategy for designing the HOL is to first decide on the best way to generalize the present formalism to FOL, and then perhaps start considering SOL, TOL, et cetera until the pattern is clear enough to generalize to HOL. The "descent" strategy is to skip even the full FOL step and instead just try to figure out how to apply lifted bayesian network style reasoning to HOL directly. My approach is somewhere in between, trying to think about both in parallel so that I might apply constraints from the top to the bottom, and from the bottom to the top, until both meet.
  5. One "descent" constraint is the idea that expressions do not always have truth values as their values. Predicates are just functions that map to truth values, and there are other types of functions. Bayesian networks can certainly deal with this! It might be good to use type reasoning to determine the sets of possible values for the bayes-net variables. (Even untyped HOL does type reasoning, after all.) This is not a difficult one.
Lots more to think about... but I might as well post this now.

Planning and inference

Today I'm studying AIMA's chapter on decision networks and trying to extend Genifer's inference algorithm to cover decisions and planning.

It seems that we need to add 2 new types of nodes to the Bayes net:
1.  utility nodes -- these are new
2.  decision nodes -- represent possible actions that can be taken

One drawback of this approach noted in AIMA is the lack of advanced planning techniques.