## Speaking significance to power

Will Lowe (2014-06-30 16:53)

### Significance

Most undergraduate methods textbooks give the impression that there is
only one form of statistical inference. It involves defining a
stochastic model of the data generation process and for each
interesting parameter constructing a statistic whose distribution
under a some ‘null’ hypothesis is known. After the observations are
made, values of that statistic that would have been sufficiently
unlikely under the null lead to us rejecting it. The relevant
numerical operationalisation of ‘sufficiently unlikely’ - *significance*
- is typically some function of the tail areas of the statistic’s
sampling distribution. Naturally, the null may be true but
nevertheless rejected on the grounds of having produced an unlikely
statistic. That happens. *The trick is just to make sure it doesn’t
happen too often.*

It takes rather a long time to explain all this. And even longer to explain it correctly.

### Power

Rather less often there is some discussion of the opposite type of
error: failing to reject a null hypothesis that should be rejected
because it is in fact false. Before any observations are taken or
analysis done, one minus the probability of making this error is
called *power.* As you would expect, power is a function of the amount
of data available for estimation, the structure of the model whose
parameters are tested and, most problematically, the distance between
the parameter really is and where the null hypothesis thinks it is.

With great power comes great certainty, or at least greater certainty about parameter values, greater chances of replication, and doubtless many other great things. Conversely, studies lacking reasonable power are common enough to elicit articles and editorials demonstrating the Dire Consequences for Science.

So far, so apparently straightforward.

It can then seem odd that Fisher - originator of the significance concept - didn’t much care for power, and that Neyman and Pearson - originators of the power concept - didn’t think much of significance. Moreover, both sides had perfectly coherent reasons for not caring about about the other. How does that work?

The story is well told in several places, but the short version is
this: the methods textbook describes a marriage of convenience. But
now that statistics as a field is all grown up, perhaps it’s time to
admit the differences and quietly apply for a no fault divorce. (There
*are* three people in this marriage, but that’s not the
problem).

### Thinking…

For Fisher, the aim of hypothesis testing is to make an ‘inductive
inference’ about a parameter. A null hypothesis is set up and a
statistic chosen in such a way as it is clear what we should expect
from this statistic when that hypothesis is true. Data is collected,
the statistic computed and a significance level reported. This is the
probability that the statistic would take a value at least as extreme
as the one observed if that hypothesis were true. If it is small, the
result of the test is an *inference* that either that hypothesis is
false or it is true but something unlikely happened. In short, Fisher
asks what we might, or might not, want to change how we think about
the world, after doing a particular experiment.

### versus doing

Neyman and Pearson - let’s called them collectively NP - think instead in terms of a sequence of tests that will be run repeatedly on new sets of data. They ask what would be the right ‘inductive behaviour’ if we wanted to minimise different types of errors. In the simplest case there are two exclusive and exhaustive hypotheses, treated symmetrically. In advance of any experimentation the experimenter determines a suitable statistic, and specifies two things: first, largest allowable probability of mistaking the one hypothesis for the other, and second the largest allowable probability of mistaking the other for the one. These are naturally thought of as tracking the different costs of acting on the basis of the one hypothesis when the other was true.

For NP the result of an analysis is then not an inference but a
decision to *behave as if* the first hypothesis were true, or as if
the second one were true.

This probably sounded more hard-headed and excitingly Popperian back when behaviourism of all sorts was cool. But it’s not the only, or even the biggest disagreement. There’s a larger one concerning the division of labour. Fisher and NP have different views of what should happen when scientists and statisticians collaborate.

### Who does the dishes?

For Fisher there is no need for the statistician to specify a *second*
hypothesis as an alternative to the null because pondering alternative
explanations is a matter for scientists, not for
statisticians. Consequently it is also not a part of the
statistician’s job to formalise power relative to various
alternatives. And because the statistician is advising on suitable
inference, not suitable behaviour, no special action is required after
a statistical analysis, however it turns out.

In contrast, for NP the entire point is to tell scientists what to
*do*, using a procedure that has some ‘long run’ guarantee of telling
something sensible - in this case, accept one or other hypothesis. But
unless we are studying industrial quality control - something NP
apparently *really* liked to think about - then this ‘long run’ of
trials will usually be imaginary.

### Doing the dishes versus doing *these* dishes

Because NP think about procedures rather than particular trials,
Fisherian significance levels *do not matter* to them. That’s
because the statistical guarantees about differnet types of errors
hold for all the trials in a (maybe imaginary) ‘long run’ but *not*
for the results of any one of them that managed to burst into
actuality.

### Consequences

We might summarise by saying that while Fisher is happy to advise on
the analysis of any particular experiment he’d rather leave the
evaluation of the general research program to the scientists involved,
whereas NP are happy to advise on the evaluation of a whole grant full
of experiments, but prefer not to make any claims about any one of
them. Unsurprisingly, significance is a reasonable concept for
Fisher’s purposes and power is reasonable concept for NP’s
purposes. But they are not the *same* purposes.

Looking at the off-diagonal cases in our exposition, Fisher could
reasonably complain that a pre-experimentally determined ‘long run’
guarantee that we won’t often mistakenly act as if we’ve ruled out
some hypothesis doesn’t tell us what we want to know about the
experiment we’ve actually just done. In the jargon: is not
(more on this below). And NP could reasonably complain that
calculating the power of an experiment *after* it has actually
happened makes no sense (although you can always find someone who
does).

Both parties could of course also unreasonably complain. Indeed, that seems to have been a speciality of this relationship. Who knew they’d be posthumously married?

### Cross purposes

In short: both parties have views that are internally consistent:
observed significance can be useful when we’re trying to *infer*
something after a particular set of observations, but only error rates
and so power, matter if we’re deciding what to *do* on the basis of a
pile of them.

Lest we draw this distinction too sharply, it seems that the
historical Fisher would prefer us to change our behaviour only after a
scientifically convincing sequence of actual tests and actual
rejections. And as it happens, any respectable null hypothesis *also*
makes predictions about what would happen in a sequence of infinite
and thus non-actual tests. But arguably that is a happy accident, not
the main point.

On the other side of things, it must be noted that NP have a pre-experimental analogue of statistical significance: . Just as one minus power is the probability of accepting an alternate hypothesis when actually the null is true, is the corresponding probability of accepting the null hypothesis when the alternate is true. Obviously and the -value are different.

I have muttered elsewhere and to any students that would listen: the root of a lot of much controversy in applied statistics is the habit of computing then pretending it was . But to be fair, current advice is a mess.

What many students learn is that they should report (Fisherian) significance, but reject the hypothesis only if it reaches an (NP-style) threshold imposed beforehand. Once they’ve done this they should act (like NP) as though the hypothesis is false and write a paper where they report (Fisher-style) every different parameter they tested. If they are unlucky the work will be criticised (NP-style) for having low power, but it might still squeak through if their (Fisherian) significance levels are high enough. In short, if you don’t get what you want from one parent, try the other.

### Implications for Science

An interesting twist occurs when we consider researchers and journal editors. We want to understand what we are studying and can run tests as we like with this purpose. The editor needs to know whether to publish something that has been submitted. For us, a Fisherian approach might be appropriate, but the editor really does have an NP style decision problem. The ‘long run’ for an editor is half a dozen articles per issue, for four issues a year, for as long as people keep sending things.