Let us assume that we have 2 observations for each of 5 persons. These observations are about the perceived body weight, as judged by two 'raters' or judges, x1 and x2.
The data are as follows:
person x1 x2
1 60 62
2 70 68
3 70 71
4 65 65
5 65 63
Because we have only two measures (variables), there is only one pair of measures to compare in this example. Very often, however, there are more than two judges involved, and hence many more pairs.
First, let us calculate the correlation between these two variables x1 and x2. This can be done in SPSS with the Correlations command (Analyze > Correlate > Bivariate, check Pearson correlation coefficient).
This yields r=.904, and the average r (over 1 pair of judges) is the same.
If you need to compute r manually, one method is to first convert x1 and x2 to Z-values [(x-mean)/s], yielding z1 and z2. Then r = SUM(z1×z2) / (n-1).
This value of r corresponds to Cronbach's Alpha of (2×.904)/(1+.904) = .946 (with N=2 judges).
Cronbach's Alpha can be obtained in SPSS by choosing Analyze > Scale > Reliability Analysis. Select the "items" (or judges) x1 and x2, and select model Alpha.
The output states: Reliability Coefficients [over] 2 items, Alpha = .9459 [etc.]
If the same average correlation r=.904 had been observed over 4 judges (i.e. over 4×3 pairs of judges), then that would have indicated an even higher inter-rater reliability, viz. alpha = (4×.904)/(1+3×.904) = .974.
Exactly the same reasoning applies if the data are not provided by 2 raters judging the same 5 objects, but by 2 test items "judging" a property of the same 5 persons. Both approaches are common in language research. Although SPSS only mentions items, and inter-item reliability; the analysis is equally applicable to raters or judges, and inter-rater reliability.
Note that both judges (items) may be inaccurate. A priori, we do not know how good each judge is, nor which judge is better. We know, however, that their reliability of judging the same thing (true body weight, we hope) increases with their mutual correlation.
Now, let's regard the same data, but in a different context. We have one measuring instrument of the abstract concept x that we try to measure. The same 5 objects are measured twice (test-retest), yielding the data given above. In this test-retest context, there is always just one correlation, and the idea of inter-rater reliability does not apply in this context. We find that rxx=.904.
This reliability coefficient r = s2T / s2x . This provides us with an estimate about how much of the total variance is due to variance in the underlying, unknown, "true" scores. In this example, 90.4% of the total variance is estimated to be due to variance of the true scores. The complementary part, 9.6% of the total variance, is estimated to be due to measurement error. If there were no measurement error, then we would predict perfect correlation (r=1); if the measurements would contain only error (and no true score component at all), then we would predict zero correlation (r=0) between x1 and x2.
In this example, we find that
se = sx × sqrt(1-.904) = sqrt(15.484) × sqrt(.096) = 1.219
check: s2x = 15.484 = s2T + s2e =
s2T + (1.219)2,
so s2T = 15.484 - 1.486 = 13.997
and indeed r = .904 = s2T / s2x = 13.997 / 15.484.
Supposedly, x1 and x2 measure the same property x. To obtain s2x, the total observed variance of x (as needed above), we cannot use x1 exclusively nor x2 exclusively. The total variance is obtained here from the two standard deviations:
s2x = sx1 × sx2
s2x = 4.18330 × 3.70135 = 15.484
In general, a reliability coefficient smaller than .5 is regarded as low, between .5 and .8 as moderate, and over .8 as high.