The reliability of a result (usually an empirical result such as a relation between variables found in a specific sample) pertains to the “representativeness” of the result found in a specific sample for the entire population. In other words, it says how probable it is that a similar result would be found if the data collection procedure (e.g., an experiment) was replicated with other samples drawn from the same population. We are rarely “ultimately” interested only in what is going on in one specific sample; we are interested in the sample only to the extent it can provide information about the population. If the study meets specific criteria (that allow us to apply the methods of statistical induction), then the reliability of a relation between variables observed in that sample can be quantitatively estimated and represented using a standard measure (technically called p-value or statistical significance level).

A reliable measure does not need to be a valid measure; some extremely reliable measures can have very low validity. For example, the height of a person might be a highly reliable (i.e., always the same across a number of measurements) but not very valid measure of the weight, because systematic factors other than the height determine the weight of a person. On the other hand, the reading on a bath scale that is placed on a soft surface (e.g., a carpet) may not be a highly reliable measure of the weight (because if the measurement is repeated for the same person, the reading will vary) but it may be fairly valid, since the outcome will depend systematically mostly on the true weight of the person, and differences between repeated measurements taken using this method are caused mostly by the random error (e.g., the angle at which the scale may tilt on the soft surface).