Professor Terry Speed’s AMSI-SSAI Lecture today at the Knibbs theatre provokes the following reflection.
Nuisances crowd out the signal – this is as true in genomics (or any of the bioinformatical omics spawned therefrom – proteomics; metabolomics; transferomics) as it is in modern official statistics, hand maiden to policy and socio-econometric modelling.
Nuisance however deserves attention. In an ideal world all data provided in statistical returns is simultaneously correct and perfectly recorded and transmitted. Furthermore the design of this ideal collection is itself perfect: the data collected is sufficient to answer the questions posed by users in their collectivity, without altering the inclination of respondents to cooperate, nor altering their behaviour in so doing. That is, the measurement process is dimensionless.
No one pretends that these conditions hold, or even approximately hold.
Instead the data resulting from the collection effort is conditioned by a quality framework that allows it to recede to the background. Official releases thus come with two crutches: formal rules of population inference – what can be inferred; its accuracy – centring on a true value, and precision – the width of the interval around an estimate containng the true value with certain confidence; and adherence to the nuisance-containing practices embodied in the collection operation.
These practices comprise the design. And this explains why official statistics is stubbornly design-based, even as statistics proper has struck out into the protean world of model building and model-based inference.
Both model-based and design-based approaches have been compromised by nuisance effects despite the loud and redundant appeals to ‘scientific method’ or ‘quality assurance’ respectively. In the one case data richness (and sample size) and spurious replicability obscured the real limitation of data acquisition; the other the drag induced by quality assurance required a stability in underlying processes which has patently been compromised in an external context of open data borders.
Can the negative control method elegantly applied to bioinformatics save official statistics too? Or rather if we take nuisance more seriously may we be inspired to find a more solid platform for the presentation of statistics used in public discourse?
If we restate the issue slightly differently – how to extract a consistent, reliable and useful signal of bearing to social governance from a multiplicity of data frames, where the criterion for signal quality (analogous to the deeper scientific truths underpinning bioinformatics or statistical investigation of physical or chemical phenomena) is encoded in the legislative ethos of government itself.
This not only allows nuisance but assumes it: the act of reducing an uncontrolled flow to a signal under metastatistical protocols (such as pre-existant or circumstantially imposed indicator series; or standards) is the badge of official statistics, best expressed by appeal to design. Certainly it is possible to improve on theory; most transparently by reviewing how deviations from design (for instance dealing with overlapping discordant collections) build a core assurance mechanism.
It happens that the methods put forward by Professor Speed in bioinformatics; and the discordancy accepting extension results that can be built from the geometric basis to sampling theory of Paul Knottnerus’ text play similar roles in the respective contexts. In both cases a fresh appraisal of the context in which statistics is applied has lead to results with immediate application as well as great generality.
References
Knottnerus, P., Samnple Surveys – a Euclidean view, Springer 2003