Reliability and Item Analysis Introductory Overview - Designing a Reliable Scale

After the discussion so far, it should be clear that, the more reliable a scale, the better (e.g., more valid) the scale. As mentioned earlier, one way to make a sum scale more valid is by adding items. Reliability and Item Analysis methods include options that allow you to compute how many items would have to be added in order to achieve a particular reliability, or how reliable the scale would be if a certain number of items were added. However, in practice, the number of items on a questionnaire is usually limited by various other factors (e.g., respondents get tired, overall space is limited, etc.). Let us return to our prejudice example, and outline the steps that one would generally follow in order to design the scale so that it will be reliable:

Step 1: Generating items. The first step is to write the items. This is essentially a creative process where the researcher makes up as many items as possible that seem to relate to prejudices against foreign-made cars. In theory, one should "sample items" from the domain defined by the concept. In practice, for example in marketing research, focus groups are often utilized to illuminate as many aspects of the concept as possible. For example, we could ask a small group of highly committed American car buyers to express their general thoughts and feelings about foreign-made cars. In educational and psychological testing, one commonly looks at other similar questionnaires at this stage of the scale design, again, in order to gain as wide a perspective on the concept as possible.

Step 2: Choosing items of optimum difficulty. In the first draft of our prejudice questionnaire, we will include as many items as possible (note that the Reliability and Item Analysis module will handle up to 300 items in a single scale). We then administer this questionnaire to an initial sample of typical respondents, and examine the results for each item. First, we would look at various characteristics of the items, for example, in order to identify floor or ceiling effects. If all respondents agree or disagree with an item, then it obviously does not help us discriminate between respondents, and thus, it is useless for the design of a reliable scale. In test construction, the proportion of respondents who agree or disagree with an item, or who answer a test item correctly, is often referred to as the item difficulty. In essence, we would look at the item means and standard deviations and eliminate those items that show extreme means, and zero or nearly zero variances.

Step 3: Choosing internally consistent items. Remember that a reliable scale is made up of items that proportionately measure mostly true score; in our example, we would like to select items that measure mostly prejudice against foreign-made cars, and few esoteric aspects we consider random error. To do so, we would look at the following spreadsheet:

STATISTICA

RELIABL.

ANALYSIS

Summary for scale: Mean=46.1100 Std.Dv.=8.26444 Valid n:100

Cronbach alpha: .794313 Standardized alpha: .800491

Average inter-item corr.: .297818

 

variable

Mean if

deleted

Var. if

deleted

StDv. if

deleted

Itm-Totl

Correl.

Squared

Multp. R

Alpha if

deleted

ITEM1

41.61000

51.93790

7.206795

.656298

.507160

.752243

ITEM2

41.37000

53.79310

7.334378

.666111

.533015

.754692

ITEM3

41.41000

54.86190

7.406882

.549226

.363895

.766778

ITEM4

41.63000

56.57310

7.521509

.470852

.305573

.776015

ITEM5

41.52000

64.16961

8.010593

.054609

.057399

.824907

ITEM6

41.56000

62.68640

7.917474

.118561

.045653

.817907

ITEM7

41.46000

54.02840

7.350401

.587637

.443563

.762033

ITEM8

41.33000

53.32110

7.302130

.609204

.446298

.758992

ITEM9

41.44000

55.06640

7.420674

.502529

.328149

.772013

ITEM10

41.66000

53.78440

7.333785

.572875

.410561

.763314

Shown above are the results for 10 items, that are discussed in greater detail in Examples. Of most interest to us are the three right-most columns in this spreadsheet. They show us the correlation between the respective item and the total sum score (without the respective item), the squared multiple correlation between the respective item and all others, and the internal consistency of the scale (coefficient Alpha) if the respective item would be deleted. Clearly, items 5 and 6 "stick out," in that they are not consistent with the rest of the scale. Their correlations with the sum scale are .05 and .12, respectively, while all other items correlate at .45 or better. In the right-most column, we can see that the reliability of the scale would be about .82 if either of the two items were to be deleted. Thus, we would probably delete the two items from this scale.

Step 4: Returning to Step 1. After deleting all items that are not consistent with the scale, we may not be left with enough items to make up an overall reliable scale (remember that, the fewer items, the less reliable the scale). In practice, one often goes through several rounds of generating items and eliminating items, until one arrives at a final set that makes up a reliable scale.