Justin Reppert bio photo

Justin Reppert

Machine learning, NLP, functional programming, PhD in the philosophy of mathematics

LinkedIn Github

In my work reviewing international development evaluations, I have seen many evaluation questions posed that the data simply cannot answer. When the data can help inform a decision, it’s a shame to see that opportunity wasted because of inadequate methods. Often, report writers and analysts face a tradeoff: when they use a well-trod analytical technique, clients are more comfortable with the familiar approach, but familiar techniques face familiar limitations.

I plan to write several posts covering these techniques, starting with Cronbach’s Alpha. Social scientists use Cronbach’s Alpha all the time, even though statisticians have published paper (PDF) after paper after paper arguing that it has limited usefulness and proposing alternatives. Even the Wikipedia article on Cronbach’s Alpha essentially amounts to a polemic against it (“this article’s tone or style may not reflect the encyclopedic tone used on Wikipedia”—no kidding).1

Yet, most of the evaluation reports I review containing either test questions or a psycho-social construct use Cronbach’s Alpha as evidence of unidimensionality, inevitably with the magical \(\alpha > 0.7\) criterion.

So why is this a mistake? And, more importantly, what can researchers do instead? Let’s take a look. I’ll keep the probability theory (and worse, reliability theory and classical test theory) to a minimum here; perhaps one reason researchers keep using Cronbach’s Alpha is because many arguments against it seem esoteric.

What is Cronbach’s Alpha, and why do people use it?

Mathematically speaking, Cronbach’s Alpha is a function of the covariances among a set of items. Its calculation is simple: square the number of items, multiply by the average of the inter-item covariances, and divide by the sum of the item variances and inter-item covariances (in both directions, so you are summing \(n^2\) variances):

\[\alpha = \frac{n^2 \cdot \text{mean}(\text{covariances})}{\text{sum}(\text{variances and covariances})}\]

Researchers typically use Cronbach’s Alpha when they want to know how much variance is item-specific and not shared by other items in the set. To a first approximation, this makes intuitive sense: holding all else equal, as the inter-item covariances increase, alpha increases, and as the item variances increase (but the covariances do not), alpha decreases.

This intuitive feel, along with the frequently cited 0.7 reliability “cutoff point” for alpha (not to mention the ease of pushing a button in statistical software that will calculate alpha for you), have made alpha very popular. In development evaluations, researchers use alpha both during instrument piloting and in the final report, with the idea that if alpha is high, that means that the set of items must measure one thing (and that one thing, the inference goes, is the very construct we intend it to measure).

Some facts about Cronbach’s Alpha

I don’t want to sound too glib here. Often, when you look at the questions on the test or items used for the construct in question, they seem well designed, and it’s often plausible that they are unidimensional. Yet, even when a test is unidimensional, with robustly, highly correlated items, alpha cannot reveal this (not to mention that alpha can also underestimate reliability).

Alpha increases with more items

This should be obvious from the formula, but I have seen a report calculate alpha for 25 items, practically guaranteeing a high alpha even with quite a bit of item-specific variance. At the very least, you should increase what you think of as a “high” alpha value as the number of items increases (say, with more than 4–6 items); simulations would probably help determine establish better rules of thumb here.

Alpha is blind to dimensionality

If a few separate dimensions are internally correlated (but orthogonal to each other) and do not introduce item-specific variance, alpha will be high. This has practical implications: if a few questions on a test introduce substantial measurement error in a correlated way (because of some shared underlying factor), then alpha will increase, leading to erroneous conclusions about how well the test measures the intended construct.

If items measure the same factor but on different scales, alpha will decrease

In the extreme case, if you have different items with perfect correlations but different linear coefficients (for instance, test items with varying difficulty curves), alpha will be less than 1 because the “steeper” items will have relatively higher variances than their covariances with the “less steep” items (larger denominator, smaller numerator). Alpha will underestimate measurement reliability when the “true scores” of the items differ, even when these items measure the same thing (i.e., are highly correlated in reality).2

Other Gotchas

I have seen other errors related to Cronbach’s Alpha in reports, driven less by a subtle misunderstanding of the mathematical possibilities and more by problems with design. Like all statistics, you should always keep in mind exactly what it is calculating and what assumptions drive theoretical conclusions from these values.

For example, floor and ceiling effects make it extremely difficult to interpret alpha, but this is less a problem with alpha and more a problem with the floor or ceiling effects themselves. When a large subset of questions is answered correctly by almost everyone or almost no one, covariance among these items will be high, but that does little to inform analysis. Generally, studies with substantial floor or ceiling effects have bigger problems to worry about than reliability, though.

A related problem in one report was caused by the test’s design. Students were not expected to get through very many of the questions on a timed test, so most of them “missed” most of the questions by design (the question order wasn’t randomized). This made the test seem very reliable according to Cronbach’s Alpha, when this high alpha was just an artifact of test design and had nothing to do with the relationship between the test’s questions and the underlying construct.

If you must use Cronbach’s Alpha, how should you use it?

Despite all this, many are comfortable with Cronbach’s Alpha and expect to see it in certain kinds of evaluations. Development evaluations are never a product of one person’s design, however well-intended, so you may need to accommodate this desire. How do you make Cronbach’s Alpha useful? Here are some suggestions.

Do factor analysis first

A principal component analysis or an exploratory factor analysis will often address the question many people have in mind when they ask to see Cronbach’s Alpha. In my experience, the underlying question often has to more about unidimensionality then reliability. After confirming that a single factor explains much of the variance, you can then use Cronbach’s Alpha to summarize remaining item-specific variance.

Inspect the covariance matrix (and correlation matrix) and look at a scatter matrix or similar visualization

I am a strong believer in always visualizing data you are interested in. Visualizations can lead us to mistake noise for signal, to be sure, but refusing to look at your data is a strange way to avoid the garden of forking paths (PDF). Inspect the variance-covariance matrix for the items of interest to see where covariances are higher and lower; also look at the correlation matrix, since it is easier to interpret correlations than covariances individually. Finally, visualize these relationships with a scatter matrix. Pandas, R, and SPSS all provide easy ways to do this. Use these visualizations to help understand the relationships between items.

Stop relying on \(\alpha > 0.7\) or \(\alpha > 0.8\)

Instead, inspect the covariances themselves. Alternatively, you could consider adding a different reliability coefficient, but I wouldn’t blame you for having difficulty in choosing one as “authoritative”; statisticians still seem to widely disagree on the relative values of different reliability coefficients. At the very least, consider the value of alpha in relation to the number of test items; a high alpha is much less impressive when there are many test items.

Back your decisions with subject-matter reasoning

This is, I think, the most important point. For example, SPSS provides “alpha if item deleted” values, and it is tempting to just remove items where this value is high during piloting (I have even seen an item dropped during analysis, solely because dropping it increased alpha). However, during a pilot, your sample size will be low, and you should be especially wary of accidentally dropping something reliable due to noise (you can mitigate this to some extent by cross-validating item removal across two folds). Further, even highly reliable items will increase alpha when removed if they scale differently from most of the other items. So you can make your test worse by relying on “alpha if item deleted” values even when these values are good population estimates.

Instead, outside rare experimental contexts where you have the luxury of randomized control, you should always justify removing an item, questioning the reliability of a test, or thinking of a test as reliable by reasoning about what the questions actually say. At the very least, you should have a good explanation why you might have item-specific variance for certain items. Are ad hoc, spurious justifications possible here? Sure, but these can always be challenged, and you should be clear that you are simply offering an educated guess. In my view, it is worse to drop an item with no justification at all, which is what you do if you make that decision solely based on the value of Cronbach’s Alpha.

  1. Hey, at least I am now officially a Wikipedia contributor after fixing a mathematical error on this page! 

  2. This is related to Cronbach Alpha’s assumption in the context of classical test theory that the items are essentially tau-equivalent; I promised not to get into test theory, so I won’t define essentially tau-equivalent here, but, as the name suggests, it has to do with the “true score” relationship among items in classical test theory. You can understand the underlying problem without needing the theory, at least for this example.