Power Analysis and Sample Size Calculation
in Experimental Design  Sampling Theory and Hypothesis Testing Logic
In most situations in statistical analysis, we do not have access to
an entire statistical population of interest, either because the population
is too large, is not willing to be measured, or the measurement process
is too expensive or timeconsuming to allow more than a small segment
of the population to be observed. As a result, we often make important
decisions about a statistical population on the basis of a relatively
small amount of sample data. Typically, we take a sample and compute a
quantity called a statistic in order to estimate some characteristic of
a population called a parameter.
For example, suppose a politician is interested in the proportion of
people who currently favor her position on a particular issue. Her constituency
is a large city with a population of about 1,500,000 potential voters.
In this case, the parameter of interest, which we might call P,
is the proportion of people in the entire population who favor the politician's
position. The politician is going to commission an opinion poll, in which
a (hopefully) random sample of people will be asked whether or not they
favor her position. The number (N)
of people to be polled will be quite small, relative to the size of the
population. Once these people have been polled, the proportion of them
favoring the politician's position will be computed. This proportion,
which is a statistic, can be called p.
One thing is virtually certain before the study is ever performed: The
population proportion (P) will
not be equal to the sample proportion (p).
Because the sample proportion (p)
involves "the luck of the draw," it will deviate from the population
proportion (P). The amount by
which the sample proportion (p)
is wrong, i.e., the amount by which it deviates from the population proportion
(P), is called sampling
error.
In any one sample, it is virtually certain there will be some sampling
error (except in some highly unusual circumstances), and that we will
never be certain exactly how large this error is. If we knew the amount
of the sampling error, this would imply that we also knew the exact value
of the parameter, in which case we would not need to be doing the opinion
poll in the first place.
In general, the larger the sample size N,
the smaller sampling error tends to be. (One can never be sure what will
happen in a particular experiment, of course.) If we are to make accurate
decisions about a parameter like p, we need to have an N
large enough so that sampling error will tend to be "reasonably small."
If N is too small, there is not
much point in gathering the data, because the results will tend to be
too imprecise to be of much use.
On the other hand, there is also a point of diminishing returns beyond
which increasing N provides little
benefit. Once N is "large
enough" to produce a reasonable level of accuracy, making it larger
simply wastes time and money.
So some key decisions in planning any experiment are, "How precise
will my parameter estimates tend to be if I select a particular sample
size?" and "How big a sample do I need to attain a desirable
level of precision?"
The purpose of the Power Analysis module is to provide
you with the statistical methods to answer these questions quickly, easily,
and accurately. The module provides simple dialogs for performing power
calculations and sample size estimation for many of the classic statistical
procedures, and it also provides special noncentral distribution routines
to allow the advanced user to perform a variety of additional calculations.
Suppose that the politician was interested in showing that more than
the majority of people supported her position. Her question, in statistical
terms: "Is p > .50?"
Being an optimist, she believes that it is.
In statistics, the following strategy is quite common. State as a "statistical
null hypothesis" something that is the logical opposite of what you
believe. Call this hypothesis H0.
Gather data. Then, using statistical theory, show from the data that it
is likely H0 is false, and should
be rejected.
By rejecting H0, you support
what you actually believe. This kind of situation, which is typical in
many fields of research, for example, is called "RejectSupport testing,"
(RS testing) because rejecting the null hypothesis supports the experimenter's
theory.
The null hypothesis is either true or false, and the statistical decision
process is set up so that there are no "ties." The null hypothesis
is either rejected or not rejected. Consequently, before undertaking the
experiment, we can be certain that only 4 possible things can happen.
These are summarized in the table below

State of the World 
HO 
H1 
Decision 
H0 
Correct
Acceptance 
Type II Error
b 
H1 
Type
I Error
a 
Correct
Rejection 
Note that there are two kinds of errors represented in the table. Many
statistics textbooks present a point of view that is common in the social
sciences, i.e., that a, the Type I error rate, must be kept at or below
.05, and that, if at all possible, b, the Type II error rate, must be
kept low as well. "Statistical power," which is equal to 1 
b, must be kept correspondingly high. Ideally, power should be at least
.80 to detect a reasonable departure from the null hypothesis.
The conventions are, of course, much more rigid with respect to a than
with respect to b. For example, in the social sciences seldom, if ever,
is a allowed to stray above the magical .05 mark. Let's review where that
tradition came from.
In the context of significance testing, we can define two basic kinds
of situations, rejectsupport (RS) (discussed above) and acceptsupport
(AS). In RS testing, the null hypothesis is the opposite of what the researcher
actually believes, and rejecting it supports the researcher's theory.
In a two group RS experiment involving comparison of the means of an experimental
and control group, the experimenter believes the treatment has an effect,
and seeks to confirm it through a significance test that rejects the null
hypothesis.
In the RS situation, a Type I error represents, in a sense, a
"false positive" for the researcher's theory. From society's
standpoint, such false positives are particularly undesirable. They result
in much wasted effort, especially when the false positive is interesting
from a theoretical or political standpoint (or both), and as a result
stimulates a substantial amount of research. Such followup research will
usually not replicate the (incorrect) original work, and much confusion
and frustration will result.
In RS testing, a Type II error is a tragedy from the researcher's standpoint,
because a theory that is true is, by mistake, not confirmed. So, for example,
if a drug designed to improve a medical condition is found (incorrectly)
not to produce an improvement relative to a control group, a worthwhile
therapy will be lost, at least temporarily, and an experimenter's worthwhile
idea will be discounted.
As a consequence, in RS testing, society, in the person of journal editors
and reviewers, insists on keeping a low. The statistically wellinformed
researcher makes it a top priority to keep b low as well. Ultimately,
of course, everyone benefits if both error probabilities are kept low,
but unfortunately there is often, in practice, a tradeoff between the
two types of error.
The RS situation is by far the more common one, and the conventions
relevant to it have come to dominate popular views on statistical testing.
As a result, the prevailing views on error rates are that relaxing a beyond
a certain level is unthinkable, and that it is up to the researcher to
make sure statistical power is adequate. One might argue how appropriate
these views are in the context of RS testing, but they are not altogether
unreasonable.
In AS testing, the common view on error rates we described above is
clearly inappropriate. In AS testing, H0 is what the researcher actually
believes, so accepting it supports the researcher's theory. In this case,
a Type I error is a false negative for the researcher's theory, and a
Type II error constitutes a false positive. Consequently, acting in a
way that might be construed as highly virtuous in the RS situation, for
example, maintaining a very low Type I error rate like .001, is actually
"stacking the deck" in favor of the researcher's theory in AS
testing.
In both AS and RS situations, it is easy to find examples where significance
testing seems strained and unrealistic. Consider first the RS situation.
In some such situations, it is simply not possible to have very large
samples. An example that comes to mind is social or clinical psychological
field research. Researchers in these fields sometimes spend several days
interviewing a single subject. A year's research may only yield valid
data from 50 subjects. Correlational tests, in particular, have very low
power when samples are that small. In such a case, it probably makes sense
to relax a beyond .05, if it means that reasonable power can be achieved.
On the other hand, it is possible, in an important sense, to have power
that is too high. For example, one might be testing the hypothesis that
two population means are equal (i.e., Mu1 = Mu2) with sample sizes of
a million in each group. In this case, even with trivial differences between
groups, the null hypothesis would virtually always be rejected.
The situation becomes even more unnatural in AS testing. Here, if N
is too high, the researcher almost inevitably decides against the theory,
even when it turns out, in an important sense, to be an excellent approximation
to the data. It seems paradoxical indeed that in this context experimental
precision seems to work against the researcher.
To summarize, in RejectSupport research:
The researcher wants to
reject H0.
Society wants to control
Type I error.
The researcher must be very
concerned about Type II error.
High sample size works for
the researcher.
If there is "too much
power," trivial effects become "highly significant."
In AcceptSupport research:
The researcher wants to
accept H0.
"Society" should
be worrying about controlling Type II error, although it sometimes gets
confused and retains the conventions applicable to RS testing.
The researcher must be very
careful to control Type I error.
High sample size works against
the researcher.
If there is "too much
power," the researcher's theory can be "rejected" by a
significance test even though it fits the data almost perfectly.