From A(ristotle) to B(ayes)

September 2019

One way a logician can penetrate the vast realm of Machine Learning is to observe that probability theory seems to be a good generalization of propositional logic: the logical values true and false are probabilities of 1 and 0 that certain propositions "occur" in a given universe.

Truth As a Probability

A probability space is defined as a triple $(Ω, A, ℙ)$ , where $Ω$ is the universe, $A$ is a tribe over $Ω$ and $ℙ : A ↦ [0,1]$ is a probability function. These concepts intuitively map to model theory as follows:

Probability Theory	Model Theory
$Ω$	set of atomic propositions
$A$	set of models
$ℙ$	interpretation function

As an example, one could (partially) define the following probability function on a very simple knowledge base:

\begin{matrix} ℙ({sibling(alice, bob)}) & = & 1 \\ ℙ({sibling(bob, alice)}) & = & 1 \end{matrix}

Probability theory is based on three axioms, from which classical properties of logics can be derived, like De Morgan's law or modus ponens. These axioms are the following:

$0 ≤ ℙ(a ∈ A) ≤ 1$
$ℙ(Ω) = 1$
$ℙ(a ∪ b) = ℙ(a) + ℙ(b)$ wherever $a, b$ are independent events

The knowledge base above would surely include a rule stating that sibling is a symmetric predicate, implying dependency between propositions (or events):

sibling(X, Y) :- sibling(Y, X) .

This rule is encoded as conditional probablities, that is:

\begin{matrix} ℙ({sibling(bob, alice)}|{sibling(alice, bob)}) & = & 1 \\ ℙ({sibling(alice, bob)}|{sibling(bob, alice)}) & = & 1 \end{matrix}

Conditional properties are defined as follows:

ℙ(b|a) = \frac{ℙ(b ∩ a)}{ℙ(a)}

By stating that $ℙ(b|a) = 1$ , we indeed condition the truth of $b$ to that of $a$ , as in $b$ :- $a$ . If $ℙ(a) = 0$ , then $ℙ(b ∩ a) = 0$ or the knowledge base is not satisfiable as it leads to division by 0. On the other hand, if $ℙ(a) =
1$ , then $ℙ(b ∩ a) = 1$ and $b$ is "certain".

After translation to a probability space, logical inference becomes simple arithmetics. In particular, modus ponens becomes the application of the following law:

\begin{array}{l} ℙ(b) & = & ℙ(b ∩ a) + ℙ(b ∩ \bar{a}) \\ = & ℙ(b|a)⋅ℙ(a) + ℙ(b| \bar{a})⋅ℙ(\bar{a}) \end{array}

Negation can also be expressed as probabilities equal to 0. However, probability theory's axioms do not enforce any of the closed-world or open-world assumption. Zero-probabilities must be explicitely stated in the latter case.

Learning from Experience

In a simplified view, our experience of the world is a series of propositions (or events) that we interpret as true or false. Rules are not directly perceived, they are learnt from correlation between events, interpreted as causation.

The cornerstone of the theory of learning probable causes is Bayes' theorem (which can be easily derived from the definition of conditial probabilities):

ℙ(a|b) = \frac{ℙ(b|a)⋅ℙ(a)}{ℙ(b)}

Given two propositions, both true, Bayes' theorem tells us that there is the same probability that one causes the other and vice-versa. If we go back to our previous example with bob and alice being siblings, stating the symmetry of the predicate may seem redundant.

However, when a knowledge base includes other individuals with incomplete information, uncertainty arise in the absence of rules. For example, consider the following propositions:

\begin{array}{l} ℙ({sibling(alice, bob)}) & = & 1 \\ ℙ({sibling(bob, alice)}) & = & 1 \\ ℙ({sibling(carol, dave)}) & = & 1 \\ ℙ({sibling(dave, carol)}) & = & 1 \\ ℙ({sibling(eve, faythe)}) & = & 1 \end{array}

A classical inference engine would entail the proposition sibling(faythe, eve), provided the symmetry of the sibling predicate. Here, however, in the presence of the propositions alone (no rule), the value of $ℙ(sibling(faythe, eve))$ is undefined.

P(sibling(eve, faythe)|sibling(eve, faythe)) =
            P(sibling(eve, faythe)|sibling(faythe, eve))
            * P(sibling(faythe, eve))
            / P(sibling(eve, faythe))

we can statistically estimate the following:

\begin{array}{l} ℙ(sibling(X, Y) | sibling(Y, X)) & = & \frac{ℙ(sibling(X, Y) ∩ sibling(Y, X))}{ℙ(sibling(X, Y))} \\ ≈ & \frac{\sum_{i} ℙ(sibling(x_{i}, y_{i}) ∩ sibling(y_{i}, x_{i}))}{\sum_{i} ℙ(sibling(x_{i}, y_{i}))} \\ ≈ & \frac{5}{6} \end{array}

This estimation is an a priori estimation.

From B back to A

Aristotle's logic is an ideal case that deals only with infinite series of events.