Inference and Universality
I
Previously I argued that ideas can be understood as inference rules, which can in turn be understood as computable functions. A mind can be represented as a collection of programs, each of which represents an inference rule. If the scheme in which these programs are expressed is computationally universal, the mind will be capable of using any conceivable inference rule. However, there are some subtleties that must be considered when choosing a scheme to express inference rules, subtleties which may not be easy to see without a more formal description of the process of inference.
I will treat a rule of inference as a function that takes in a sequence of statements, and returns a set of statements. Each inference rule is defined with respect to some statement universe, the set of all possible statements that the rule can act upon. Given a statement universe U, a rule of inference r is a function r:U*→𝒫 (U). Here U* the set of all finite sequences of elements in U (the “free monoid” or “Kleene closure” of U), and 𝒫 (U) represents the set of all subsets of U (the “power set” of U).
Given an inference rule r:U*→𝒫 (U) and a set of statements S⊂U, the first-order consequences of S with respect to r, denoted Fᵣ(S), is defined as the union of S and r(s) for all sequences s∈S*:
The first-order consequences of S include all statements in S along with all statements that can be derived by applying r a single time to a sequence of elements in S. By applying this function iteratively, we can consider the second-order consequences of a set of statements: The second-order consequences of a set S is given by Fᵣ²(S), and contains all statements that can be derived, with no more than 2 applications of r, from the statements in S. Taking this iteration to the limit, we can consider Fᵣ ᪲(S), which represents the set of all consequences that can be derived from S with any number of applications of r.
II
Let’s consider a few example inference rules. First, consider a rule a, defined with respect to the statement universe ℤ, the set of all integers. By definition, a is a function which maps from the set of all sequences of integers, ℤ* (not to be confused with the set of all non-negative integers, which is also sometimes denoted ℤ*), to the set of all sets of integers, 𝒫 (ℤ). For a sequence of integers s, a(s) will be the empty set unless s is of length 2, in which case a(s) will be a set containing a single integer equal to the sum of the two elements of s. In other words:
With a defined, we can examine the consequences that this rule implies for various sets of statements (integers, in this case). The set of first order consequences of the set {1} with respect to a is defined as:
{1}* is the set of all sequences consisting only of 1s: {1}* = {(), (1), (1, 1), (1, 1, 1), …}. a returns the empty set for sequences of any length other than 2, so of all the elements in this infinite set of sequences, a will return a non-empty set only for (1,1), in which case it will return the set containing the sum of the two integers, {2}. Therefore, Fₐ({1}) = {1} ∪ a((1,1)) = {1} ∪ {2} = {1, 2}.
We can also derive the second-order consequences of {1} with respect to a, or, equivalently, the first-order consequences of {1, 2}:
As before, {1, 2}* is an infinite set, but a will return a non-empty set for the length-2 sequences, of which there are a finite number. Specifically, there are four unique length-2 sequences in {1, 2}*: (1, 1), (1, 2), (2, 1), and (2, 2). For each of these sequences, a will return the sets {2}, {3}, {3}, {4}, respectively. Therefore, Fₐ²({1}) = Fₐ({1, 2}) = {1,2} ∪ {2} ∪ {3} ∪ {3} ∪ {4} = {1, 2, 3, 4}.
By similar logic, it can be shown that Fₐ³({1}) = {1, 2, 3, 4, 5, 6, 7, 8}, Fₐ⁴({1}) = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}, and, more generally, that Fₐⁿ({1}) = {i∈ℤ | 1 ≤ i ≤ 2ⁿ}. The set of all consequences of {1} with respect to a is, therefore, the set of all positive integers, Fₐ ᪲({1}) = ℤ⁺.
Now consider another inference rule m. m is defined very similarly to a, but using multiplication rather than addition:
As we did with a, let’s consider the consequences of the set {1} with respect to m. m returns a non-empty set only for length-2 sequences, so the only sequence in {1}* we need to consider is (1,1). When we apply m to this sequence, we get m((1,1)) = {1*1} = {1}. Therefore, Fₘ({1}) = {1} ∪ {1} = {1}. Clearly, iterating Fₘ will have no effect, as we have immediately hit a fixed point: m does not allow us to derive any new statements from the set {1}. If we consider {2} instead of {1}, we will find that the only length-2 sequence in {2}* that matters is (2, 2), and m((2, 2)) = {4}, so Fₘ({2}) = {2} ∪ {4} = {2, 4}. Further, it can also be shown that Fₘ²({2}) = {2, 4, 8, 16}, Fₘ³({2}) = {2, 4, 8, 16, 32, 64, 128, 256}, and, in the general case, that Fₘⁿ({2}) = {2ⁱ | 1 ≤ i ≤ 2ⁿ}.
III
I’ve mentioned previously that we should only consider computable inference rules when imagining ideas that a mind might use. However, the normal notion of computability for functions is slightly too permissive for the purposes of this formalism. An inference rule should be considered computable only if the first-order consequences of any finite subset of the statement universe can be computed in finite time. To this end, I will say that an inference rule r over the statement universe U is computable if and only if it satisfies the following two criteria:
r is a total computable functions.
For any finite set S⊂U, there are, if any, only a finite number of sequences s∈S* such that r(s) is non-empty. Further, there must be some algorithm that computes, for any finite set of statements S⊂U, this finite set of sequences, or a finite superset of it.
These criteria ensure that it will be possible to compute Fᵣ(S) in finite time for any finite set of statements S (and, by induction, the same can be said for Fᵣⁿ(S) for any finite positive integer n). An inference rule that didn’t satisfy criterion 2 would not necessarily have this property, as it may be necessary to evaluate r an infinite number of times - once for each element in the infinite set S* - in order to evaluate Fᵣ. Thus, criterion 2 effectively ensures that there exists some “short-cut” that allows us to compute Fᵣ(S) with only a finite number of evaluations of r, rather than an infinite number.
The aforementioned rules a and m both satisfy these criteria, and are thus considered computable inference rules.
IV
One inference rule r:U*→𝒫 (U) can simulate another rule t:V*→𝒫 (V) if and only if there exists some encoding function E:𝒫 (V)→𝒫 (U) and decoding function D:𝒫 (U)→𝒫 (V) such that, for all finite sets of statements S⊂V, Fₜ ᪲(S) = D(Fᵣ ᪲(E(S))). Note that this does not mean that Fₜⁿ(S) must be equal to D(Fₜⁿ(E(S))) for any value of n, but only that the two functions ultimately converge to the same set as the number of iterations goes to infinity.
As an example, let’s construct a new inference rule β that can simulate both a and m. β will operate over a slightly larger statement universe than that of a or m: ℤ ∪ {A, M}. Here A and M are simply unique symbols that are distinct from any of the elements in ℤ. β will be defined as:
Whereas a and m returned non-empty sets for length-2 sequences, β returns non-emtpy sets only for length-3 sequences. More specifically, for a sequence s = (s₁,s₂,s₃), β(s) will be non-empty only when s₂ and s₃ are both integers and s₁ is either A or M. If these conditions are met, β(s) will contain a single integer: either the sum of s₂ and s₃ if s₁ = A, or the product of s₂ and s₃ if s₁ = M.
β can be thought of as having two distinct “modes”: an “addition mode”, and a “multiplication mode”. You put β into addition mode by giving a sequence starting with A, or into multiplication mode by giving it a sequence staring with M. These two modes allow β to easily simulate both a and m.
Consider how β behaves on the set {A, 1}. In finding Fᵦ({A, 1}), we only need to consider the length-3 sequences of {A,1}*, as β will return the empty set for any other sequence. Even among the length-3 sequences, we need only consider those that start with A or M and end with two integers. Thus, of all sequences in {A,1}*, there is only one for which β that will return a non-empty set: (A, 1, 1). Therefore, Fᵦ({A, 1}) = {A,1} ∪ β((A, 1, 1)) = {A, 1} ∪ {2} = {A, 1, 2}. Iterating this process, it can be shown that Fᵦ²({A, 1}) = Fᵦ({A, 1, 2}) = {A, 1, 2, 3, 4}, Fᵦ³({A, 1}) = {A, 1, 2, 3, 4, 5, 6, 7, 8}, and Fᵦ⁴({A, 1}) = {A, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}.
Notice the similarity between the behavior of the set {A, 1} under β and the behavior of {1} under a: For each positive integer n, Fᵦⁿ({A, 1}) = Fₐⁿ({1}) ∪ {A}. More generally, for any set of integers S⊂ℤ, Fᵦⁿ({A} ∪ S) = Fₐⁿ(S) ∪ {A}. This is because, for any length-2 sequence (s₁, s₂)∈S², there is a corresponding length-3 sequence (A, s₁, s₂)∈({A} ∪ S)³ such that β((A, s₁, s₂)) = a((s₁, s₂)). Therefore, β can simulate a with the encoding function Eₐ(S) = S ∪ {A} and decoding function Dₐ(S) = S - {A}. Similarly, β can simulate m using the encoding function Eₘ(S) = S ∪ {M} and decoding function Dₐ(S) = S - {M}.
V
An inference rule is computationally universal with respect to a particular statement universe if and only if it can simulate every computable inference rule over that statement universe. A computationally universal inference rule can also simulate any finite collection of computable inference rules over the statement universe, because any two computable inference rules can be combined into a single computable inference rule (for two rules r and q over the same universe, we can define the combination of these rules as u(s) = r(s) ∪ q(s), and it can trivially be shown that this combination rule will be computable as long as r and q are computable).
While the previously introduced rule β can simulate a and m, it clearly is not computationally universal, as there are countless other rules that it cannot simulate. However, the idea of a single inference rule which can take on multiple different “modes” will help us in constructing a universal inference rule.
Let T be the set of all turing machines with the tape alphabet {0, 1, &}. For any computable inference rule r:ℤ*→𝒫 (ℤ), there will be some turing machine t∈T that implements r. A rule r takes in a sequence of integers and returns a set of integers, so the corresponding turing machine should be able to begin with a tape representing an encoded list of integers, and terminate with a tape representing an encoded set of integers. Using the alphabet {0, 1, &}, we can represent an integer as a bit-string, and a sequence of integers as a concatenation of multiple such bit-strings delineated by & symbols. A turing machine that represents a rule r must be able to, for any sequence s∈ℤ*, begin with a tape encoding s in this way, and end with a tape encoding another sequence of integers that contains each of the integers in r(s), and only those integers.
Using this family of turing machines, we can define an inference rule χ that will be computationally universal with respect to the statement universe ℤ. χ itself will be defined over the statement universe ℤ ∪ T. Let i be a function that takes in a turing machine from T and a sequence of integers, and returns the set of integers that the turing machine would return if run with the sequence of integers as an input (using the aforementioned encoding scheme). If a turing machine would not halt for the given sequence, i returns the empty set. Formally, i is a function i:T×ℤ*→𝒫 (ℤ). Using i, we can define χ:(ℤ ∪ T)*→𝒫 (ℤ ∪ T) as:
Given a sequence of elements s∈(ℤ ∪ T)*, χ(s) will be empty unless the first element in the sequence, s₁, is a member of T, and the remaining elements in the sequence are all integers. In this case, χ(s) will return the set of integers left on the tape when the turing machine s₁ halts, after starting with tape representing the sequence the integers (s₂,…sₙ).
The construction of this rule is very similar to that of the earlier rule β that was designed to simulate a and m. Just as β must be given a sequence that begins with an element of {A, M} in order to produce a non-empty result, χ must be given a sequence that begins with an element of T, and in both cases the remainder of the sequence must consist only of integers. Both β and χ can effectively be put into different “modes” depending on the first element in the provided sequence, but while β can take on only two possible modes, represented by A and M, χ can take on an infinite variety of different modes - one for each turing machine in T.
For any computable inference rule r:ℤ*→𝒫 (ℤ) there will be at least one turing machine t∈T that represents r. This means that any such r can be simulated by χ. Given a turing machine t∈T that represents r, χ can simulate r using the encoding function Eᵣ(S) = S ∪ {t} and decoding function Dᵣ(S) = S - {t}. Thus, the inference rule χ is computationally universal with respect to the statement universe ℤ.
VI
While χ is computationally universal with respect to ℤ, it is not itself a computable inference rule. It fails the first criterion of the aforementioned definition of computability because evaluating χ may involve the execution of any turing machine in T, and some turing machines may not halt, so χ is not a total computable function. Additionally, it fails second criterion, because some turing machines in T will result in a non-empty set for all possible inputs. This means that χ could not be used in compute the first-order consequences of a set of statements without potentially getting stuck in an infinite loop.
Thankfully, it’s possible to construct an inference rule, which I will call γ, that is both universal and computable. Like χ, γ will be defined using turing machines, but it will do so in a different way. While a single application of χ involves potentially executing a turing machine from start to finish, a single application of γ will involve only a single step of the execution of a turing machine. γ will be able compute the outcome produced by any halting turing machine in T, but it may take many iterated applications of γ to finish this computation. Due to this more finely-divided way of simulating the behavior of turing machines, we can guarantee that γ is a total computable function.
χ’s statement universe contained two types of elements: integers and turing machines. γ will use a larger statement universe containing four types of elements: integers, turing machines, sequences of integers, and tape-states. A tape-state includes the sequence of symbols on a turing machine’s tape, the position of the head, and the state of the head. Formally, γ’s statement universe is ℤ ∪ T ∪ ℤ* ∪ 𝕋, where 𝕋 is the set of all possible tape-states.
Let I be a function I:T×ℤ*→𝕋 that takes a turing machine and a sequence of integers as inputs, and returns the initial tape-state for that turing machine and sequence of integers. Let N:𝕋→𝕋 be the function that maps from each tape-state to the new tape-state produced after a single cycle of reading the symbol beneath the head, writing a new symbol beneath the head, changing the inner state of the head, and moving the head either left or right on the tape. Let H:𝕋→{True, False} be a function that determines whether a given tape-state represents a halted state (say, by checking if the current head state is in some pre-defined set of halting states). Let R:𝕋→𝒫 (ℤ) be the function that maps from halted tape-states to the sets of integers encoded on those halted tapes. Finally, let j:ℤ*×ℤ→ℤ* be a function that accepts a sequence of integers and a single integer as inputs, and returns a new sequence formed by appending the integer onto the end of the sequence. We can then define γ as:
This rule is computable, as it clearly satisfies the two criteria of computability provided earlier. To show that a rule is computationally universal with respect to the statement universe ℤ, it suffices to show that it can simulate χ. To understand how γ can simulate χ, let’s first examine how γ behaves on a few different sets of statements.
First consider the set {t, z}, where t∈T and z∈ℤ*. In finding Fᵧ({t, z}), there will only be one sequence in {t, z}* for which γ will return a non-empty set: (t, z). When provided with this sequence, γ will invoke I on t and z, which will result in an initial tape-state for the machine t with the sequence z encoded on the tape. I’ll call this tape-state 𝕥₁, so Fᵧ({t, z})={t, z, 𝕥₁}. If we apply Fᵧ once more, there will be one new sequence for which γ will return a non-empty set: (𝕥₁). Presuming that the turing machine does not start out in a halted state, i.e. presuming ¬H(𝕥₁), γ will, when provided with the sequence (𝕥₁), invoke N to produce a new tape-state N(𝕥₁) = 𝕥₂, so Fᵧ²({t, z}) = Fᵧ({t, z, 𝕥₁}) = {t, z, 𝕥₁, 𝕥₂}. As we continue to iteratively apply Fᵧ, we will be able to derive a new tape-state on each step from the tape-state derived in the previous step. If t does not halt on the input z, then this process will continue forever, but if it does halt, we will eventually reach some tape-state 𝕥ₙ such that H(𝕥ₙ). Once we reach this point, i.e. when our current set of statements is {t, z, 𝕥₁, 𝕥₂, … 𝕥ₙ}, if we apply Fᵧ one final time, rather than invoking N on 𝕥ₙ as it did with the previous tape-states, γ will instead invoke R on 𝕥ₙ, which will result in a set of integers decoded from 𝕥ₙ’s tape. This means that the final set that the Fᵧ will converge on when started with {t, z} is Fᵧ ᪲({t, z}) = {t, z, 𝕥₁, 𝕥₂, … 𝕥ₙ} ∪ R(𝕥ₙ).
In summary, if we begin with a sequence of integers and turing machine that halts on that sequence, we can, by iteratively applying γ, eventually derive the set of integers that the machine would return when run on that sequence.
Now consider how Fᵧ behaves on a set like {0, ()}. When γ is applied to a sequence consisting of a (possibly empty) sequence of integers and a single integer, like ((), 0), it uses j to create a new sequence consisting of the provided sequence with the provided integer appended onto the end, e.g. j((), 0) = (0). Iteratively applying Fᵧ therefore results in sequences of arbitrary length:
Fᵧ({0, ()}) = {0, (), (0)}
Fᵧ²({(0, ()}) = {0, (), (0), (0, 0)}
Fᵧ³({0, ()}) = {0, (), (0), (0, 0), (0,0,0)}
Fᵧ ᪲({0, ()}) = {0} ∪ {0}*
If we instead use a set containing multiple different integers, like {0, 1, ()}, iteratively applying Fᵧ will ultimately result in a set containing every possible sequence of those integers:
Fᵧ({0, 1, ()}) = {0, 1, (), (0), (1)}
Fᵧ²({0, 1, ()}) = {0, 1, (), (0), (1), (0, 0), (0, 1), (1, 0), (1,1)}
Fᵧ ᪲({0, 1, ()}) = {0, 1} ∪ {0, 1}*
For γ to be able to simulate χ, there must exist some encoding and decoding functions Eᵧ and Dᵧ such that, for any finite set S of statements in in χ’s statement universe ℤ ∪ T, Dᵧ(Fᵧ ᪲(Eᵧ(S))) = Fᵪ ᪲(S). This can be accomplished with the encoding function Eᵧ(S) = S ∪ {()} and decoding function Dᵧ(S) = S ∩ (ℤ ∪ T).
Imagine that we had some set of statements S∈ℤ ∪ T, and we wanted to use Fᵧ to predict Fᵪ ᪲(S). First, we would encode the set with Eᵧ to produce Eᵧ(S) = S ∪ {()}. As we saw before, whenever we iteratively apply Fᵧ on a set that contains the empty sequence () and some collection of integers, it will eventually produce every possible sequence of those integers. Additionally, whenever we iteratively apply Fᵧ on a set containing a halting turing machine and a sequence of integers, it will eventually compute the set of integers that results from running the turing machine on that sequence. These two facts together imply that anything which can be computed by iteratively applying Fᵪ can also be computed by iteratively applying Fᵧ, meaning Fᵧ ᪲(Eᵧ(S)) will contain every statement that Fᵪ ᪲(S) contains. The decoding function Dᵧ(S) = S ∩ (ℤ ∪ T) can then be used to remove any extraneous elements not present in χ’s statement universe (i.e. those in ℤ* or 𝕋), and so Fᵪ ᪲(S) = Dᵧ(Fᵧ ᪲(Eᵧ(S))).
Thus, γ is not only computable, but also computationally universal with respect to ℤ. The fact that γ is computable means that we can always compute the first-order consequences of a statement set with respect to γ in finite time, and this along with the fact that γ is universal means that γ could, given enough time and memory capacity, be used to perfectly predict, for any finite n, the nth-order consequences of any other computable inference rule on any set of statements in ℤ.
VII
Earlier I said that a mind can be understood as a collection of programs representing inference rules, and so long as the scheme in which these programs are expressed is universal, the mind will be capable of effectively using any computable inference rule. With the notion of a computationally universal inference rule defined, I can now restate this more formally.
A mind is a collection of statements equipped with an inference rule, and the mind uses it’s inference rule to explore the consequences of it’s statements. A mind equipped with a computationally universal inference rule would, by definition, be able to act as though it were equipped with any other computable inference rule, or any set of computable inference rules. In other words, the mind would be capable of supporting any conceivable process of inferential reasoning. In this post I have shown that there exists at least one inference rule, γ, that is both computable and computationally universal, meaning that it could be used to implement such a mind.
In the next post I’ll go into more detail about the structure of the population of statements within a mind, and how this population of statements evolves over time.