Appendix: Probabilities and statistics
The language of inference; the language of science.
Table of Contents
Good mathematicians see analogies between theorems; great mathematicians see analogies between analogies.
—S. Banach, via [BROKEN LINK: No match for fuzzy expression: *Probability theory; the Logic of Science].
A computer deals with information. It also deals with uncertainty. As a “computing machine,” much of what a computer should do is be able to infer: to advance its state of knowledge from prior assumptions.
Likewise, a designer must have good sense and guess what will delight people. The methods of probability theory are the straight-edge to their curious intuition. They are tools for inference: for advancing our state of knowledge.
1. Propositions are the basic units of uncertainity
A proposition is any statement that may be either true or false. A probability is a number assigned (mapped; given) to a proposition. It represents how plausible (likely) the proposition is true. Thus it represents a state of knowledge attached to a given proposition.
By convention, probabilities are real numbers and are in the interval (range; continuum) \(0\) to \(1.1\). A probability of \(0\) means that the proposition is true. A probability of \(1\) means that the proposition is false. Probabilities may be stated without being known, and in fact most probabilities are not known. The power of probabilities is that they may be compared with each other. My odds are better than yours.
We will represent propositions using capital letters, \(A, B, C, \ldots, Z\). We will represent probabilities (remember, a number assigned to each proposition representing how plausible it is, which may be unknown), using the notation \(P(A)\).
By being given a set of initial probabilities which represent our “starting assumptions”, (background knowledge; prior information), we may use a set of rules of probability to calculate the probabilities for other, related propositions. Seen this way, probability theory is an extension of logic. Note that thus far, we have treated probabilities as an extension of our state of knowledge, not as a construct of the physical world. In his book, Probability Theory: the Logic of Science, E. T. Jaynes advocated for this method of using this powerful tool:
In our terminology, a probability is something that we assign, in order to represent a state of knowledge, or that we calculate from previously assigned probabilities according to the rules of probability theory. A frequency is a factual property of the real world that we measure or estimate. The phrase ’estimating a probability; is just as much a logical incongruity as ’assigning a frequency’ or ’drawing a square circle’.
The fundamental, inescapable distinction between probability and frequency lies in this relativity principle: probabilities change when we change our state of knowledge; frequencies do not.
2. Inference as a tool for thinking
To form a judgement about the likely truth or falsity of any proposition \(A\), the correct procedure is to calculate the probability that A is true:
\[ P(A|E_1 E_2 \ldots) \]
conditional on all the evidence at hand.
— E. T. Jaynes, [BROKEN LINK: No match for fuzzy expression: *Probability theory; the Logic of Science].
Probabilities are useful because they allow us to infer. This is the process of finding truths drawn from truths which are admitted or supposed to be true. The origin comes from the Latin term inferens, which means “advancing,” and indeed we are advancing our state of knowledge from what we had before. This had been understood by the time Francis Bacon published his Essays.
When we apply statistical methods to any problem, we will have both background information and new information. The background knowledge, \(X\), is simply all information that is not the data. That is, there are no absolute probabilities.
Any probability conditioned only on X is a prior probability, \(P(A|X)\).
Given a probability \(P(X|\theta)\) (the “likelihood”), the posterior probability is the probability of parameters of the hypothesis given the data (evidence) \(X\) \[ P(\theta|X) \]
3. Probabilities of continuous functions
If \(f\) is a continuously variable real parameter of interest, we may use the propositions
\[ F' \equiv (f \le q) \] \[ F'' \equiv (f > q) \]
Then we may define a function that varies this \(q\), and note that the probability of \(F'\) must also depend on some information \(Y\):
\[G(q) \equiv P(F'|Y) \]
Then what is the probability that \(f\) lies in any specified interval \(a < f \le b\)? Define \[A\equiv (f \le a)\] \[B\equiv (f\le b)\] \[W\equiv (a < f \le b)\]
Then \(B = A + W\), but since \(A\) and \(W\) are mutually exclusive:
\[ P(B|Y) = P(A|Y) + P(W|Y) \]
Since \(P(B|Y) = G(b)\) and \(P(A|Y) = G(a)\), we have
\[ P(a < f \le b|Y) = P(W|Y) = G(b) - G(a) \]
Since \(G(q)\) is continuous and differentiable,
\[ P(a < f \le b|Y) = \int_a^b g(f)\, df \]
where \(g(f) = \frac{d}{df} G(f) \ge 0\), the probability density function (pdf; probability distribution function) for \(f\).
The pdf for \(f\) is a function defined such that integrating it within a range (area) gives the probability that this variable real parameter \(f\) exists within that range. It is the probability distribution for \(f\). \(f\) is a unknown constant parameter: the parameter is not distributed but the probability (of \(f\) being a certain value) is.
4. The role of the ’probability distribution’
A probability distribution has a demonstrable information content. Their imagined, and irrelevant, frequency connections are not important.
5. The sampling problem
6. Distributions
In the literature