Speaking significance to power
Most undergraduate methods textbooks give the impression that there is only one form of statistical inference. It involves defining a stochastic model of the data generation process and for each interesting parameter constructing a statistic whose distribution under a some 'null' hypothesis is known. After the observations are made, values of that statistic that would have been sufficiently unlikely under the null lead to us rejecting it. The relevant numerical operationalisation of 'sufficiently unlikely' - significance - is typically some function of the tail areas of the statistic's sampling distribution. Naturally, the null may be true but nevertheless rejected on the grounds of having produced an unlikely statistic. That happens. The trick is just to make sure it doesn't happen too often.
It takes rather a long time to explain all this. And even longer to explain it correctly.
Rather less often there is some discussion of the opposite type of error: failing to reject a null hypothesis that should be rejected because it is in fact false. Before any observations are taken or analysis done, one minus the probability of making this error is called power. As you would expect, power is a function of the amount of data available for estimation, the structure of the model whose parameters are tested and, most problematically, the distance between the parameter really is and where the null hypothesis thinks it is.
With great power comes great certainty, or at least greater certainty about parameter values, greater chances of replication, and doubtless many other great things. Conversely, studies lacking reasonable power are common enough to elicit articles and editorials demonstrating the Dire Consequences for Science.
So far, so apparently straightforward.
It can then seem odd that Fisher - originator of the significance concept - didn't much care for power, and that Neyman and Pearson - originators of the power concept - didn't think much of significance. Moreover, both sides had perfectly coherent reasons for not caring about about the other. How does that work?
The story is well told in several places, but the short version is this: the methods textbook describes a marriage of convenience. But now that statistics as a field is all grown up, perhaps it's time to admit the differences and quietly apply for a no fault divorce. (There are three people in this marriage, but that's not the problem).
For Fisher, the aim of hypothesis testing is to make an 'inductive inference' about a parameter. A null hypothesis is set up and a statistic chosen in such a way as it is clear what we should expect from this statistic when that hypothesis is true. Data is collected, the statistic computed and a significance level reported. This is the probability that the statistic would take a value at least as extreme as the one observed if that hypothesis were true. If it is small, the result of the test is an inference that either that hypothesis is false or it is true but something unlikely happened. In short, Fisher asks what we might, or might not, want to change how we think about the world, after doing a particular experiment.
Neyman and Pearson - let's called them collectively NP - think instead in terms of a sequence of tests that will be run repeatedly on new sets of data. They ask what would be the right 'inductive behaviour' if we wanted to minimise different types of errors. In the simplest case there are two exclusive and exhaustive hypotheses, treated symmetrically. In advance of any experimentation the experimenter determines a suitable statistic, and specifies two things: first, largest allowable probability of mistaking the one hypothesis for the other, and second the largest allowable probability of mistaking the other for the one. These are naturally thought of as tracking the different costs of acting on the basis of the one hypothesis when the other was true.
For NP the result of an analysis is then not an inference but a decision to behave as if the first hypothesis were true, or as if the second one were true.
This probably sounded more hard-headed and excitingly Popperian back when behaviourism of all sorts was cool. But it's not the only, or even the biggest disagreement. There's a larger one concerning the division of labour. Fisher and NP have different views of what should happen when scientists and statisticians collaborate.
Who does the dishes?
For Fisher there is no need for the statistician to specify a second hypothesis as an alternative to the null because pondering alternative explanations is a matter for scientists, not for statisticians. Consequently it is also not a part of the statistician's job to formalise power relative to various alternatives. And because the statistician is advising on suitable inference, not suitable behaviour, no special action is required after a statistical analysis, however it turns out.
In contrast, for NP the entire point is to tell scientists what to do, using a procedure that has some 'long run' guarantee of telling something sensible - in this case, accept one or other hypothesis. But unless we are studying industrial quality control - something NP apparently really liked to think about - then this 'long run' of trials will usually be imaginary.
Doing the dishes versus doing these dishes
Because NP think about procedures rather than particular trials, Fisherian significance levels do not matter to them. That's because the statistical guarantees about differnet types of errors hold for all the trials in a (maybe imaginary) 'long run' but not for the results of any one of them that managed to burst into actuality.
We might summarise by saying that while Fisher is happy to advise on the analysis of any particular experiment he'd rather leave the evaluation of the general research program to the scientists involved, whereas NP are happy to advise on the evaluation of a whole grant full of experiments, but prefer not to make any claims about any one of them. Unsurprisingly, significance is a reasonable concept for Fisher's purposes and power is reasonable concept for NP's purposes. But they are not the same purposes.
Looking at the off-diagonal cases in our exposition, Fisher could reasonably complain that a pre-experimentally determined 'long run' guarantee that we won't often mistakenly act as if we've ruled out some hypothesis doesn't tell us what we want to know about the experiment we've actually just done. In the jargon: alpha is not p (more on this below). And NP could reasonably complain that calculating the power of an experiment after it has actually happened makes no sense (although you can always find someone who does).
Both parties could of course also unreasonably complain. Indeed, that seems to have been a speciality of this relationship. Who knew they'd be posthumously married?
In short: both parties have views that are internally consistent: observed significance can be useful when we're trying to infer something after a particular set of observations, but only error rates and so power, matter if we're deciding what to do on the basis of a pile of them.
Lest we draw this distinction too sharply, it seems that the historical Fisher would prefer us to change our behaviour only after a scientifically convincing sequence of actual tests and actual rejections. And as it happens, any respectable null hypothesis also makes predictions about what would happen in a sequence of infinite and thus non-actual tests. But arguably that is a happy accident, not the main point.
On the other side of things, it must be noted that NP have a pre-experimental analogue of statistical significance: alpha. Just as one minus power is the probability of accepting an alternate hypothesis when actually the null is true, alpha is the corresponding probability of accepting the null hypothesis when the alternate is true. Obviously alpha and the p-value are different.
I have muttered elsewhere and to any students that would listen, that the root of a lot of much controversy surrounding applied statistics is the habit of computing p then pretending it was alpha. But to be fair, current advice is a mess.
What many students learn is that they should report (Fisherian) significance, but reject the hypothesis only if it reaches an (NP-style) threshold imposed beforehand. Once they've done this they should act (like NP) as though the hypothesis is false and write a paper where they report (Fisher-style) every different parameter they tested. If they are unlucky the work will be criticised (NP-style) for having low power, but it might still squeak through if their (Fisherian) significance levels are high enough. In short, if you don't get what you want from one parent, try the other.
Implications for Science
An interesting twist occurs when we consider researchers and journal editors. We want to understand what we are studying and can run tests as we like with this purpose. The editor needs to know whether to publish something that has been submitted. For us, a Fisherian approach might be appropriate, but the editor really does have an NP style decision problem. The 'long run' for an editor is half a dozen articles per issue, for four issues a year, for as long as people keep sending things.