Back to Top

Testing, testing: the 123's of testing

By: Annika B. Bergbauer, Eric A. Hanushek, and Ludger Woessmann

RISE Working Paper (October 2018)

Commentary by Lee Crawfurd

The effects of high-stakes testing is one of the most controversial issues in contemporary education debates worldwide. Globally, teachers tend to oppose standardised tests, partly because they perceive them to narrow the curriculum and crowd out wider learning. However, it is intuitive that the effects of testing could vary dramatically by context. Indeed, the impact may very well follow a so-called “Laffer curve”. At low levels of testing, an increase may lead to better performance as it provides relevant information and incentives to actors in the education system. Yet if there are already high levels of testing, further increases may very well decrease performance, due to stress, for example, or the effects of an overly-narrowed curriculum. If so, we should expect the impact of testing to follow an inverted U-curve – or at the very least display diminishing returns. Furthermore, the impact of tests is also likely to depend on exactly how they are used in the education system.

This paper provides perhaps the first systematic evidence on these issues, using data from all six waves of PISA, which has been carried out every three years between 2000 and 2015. The sample covers 59 countries and over 2 million pupils. Since neither pupils nor teachers have incentives to perform well in PISA, the authors effectively bypass any potential effects that arise from gaming or manipulation, which often plague tests to which incentives are directly assigned. Instead, by analysing low-stakes outcomes, the authors analyse how testing affects actual knowledge acquisition.

The authors begin with a simple theoretical framework. In this framework, parents seek to balance pupils’ long-term wellbeing, which depends on what they learn, with their short-term happiness, which is negatively affected by their effort and positively affected by short-term rewards. But parents can’t easily observe the extent to which children learn, so they can’t reward them and their teachers for putting in the right kind of effort. Standardised testing help provide such information, thereby providing incentives to teachers and/or pupils for better performance. Yet testing also has downsides: you can’t test for everything, so effort could be distorted away from unobserved outcomes. In addition, there may be teaching to the test, gaming, and even outright cheating. The framework therefore suggests that standardised testing may very have different effects in different contexts: in low-performing countries, the impact is likely to be positive, but in high-performing countries, the impact may be negative. Of course, the extent to which the positive or negative effects of testing dominate, and in which context, can’t be decided by theory alone. Ultimately this is an empirical question.

Next, the authors define four discrete uses/types of testing. In the first type, which the authors call “standardised external comparisons”, pupils are compared in exit exams at the end of a school phase to regulate entry to the next one. This information is then made public through school-league tables. In other words, testing is used to monitor both schools and pupils. The second type, which the authors call “standardised monitoring”, is similar to the first category, but with the important difference that information is not made public. This in turn implies that the tests provide less powerful incentives for actors in the education system. The third type, which the authors call “internal testing”, essentially means formative assessment. This form of testing has weak incentive effects because of the lack of comparability across schools and students. In the fourth type, which the authors call “internal teacher monitoring”, internal pupil assessments are used by schools to judge teacher effectiveness, but carry no high-stakes incentives for students.

The authors then classify each country, in each year, on these categories, using principal/head teacher surveys, in some cases supplemented by data from other OECD and EU sources. To estimate the effect of the different testing regimes on pupil outcomes, the paper then looks at changes in testing regimes within countries over time, rather than merely looking at cross-country correlations. This strategy removes country-specific confounders that are broadly unchanged over the period analysed, such as language or culture.

The findings show that the only average positive effects come from standardised external testing that is comparable across schools and made public. Standardised monitoring and internal testing have no effects on average, while internal monitoring in fact has an average negative effect. In other words, across all countries, standardised external testing – the category carrying the strongest accountability – appears to be the most fruitful avenue for improving low-stakes outcomes.

But what is really interesting, as shown in Figure 3 of the paper (reproduced here) is how the effects vary for countries at different initial levels of performance. For all testing categories apart from internal testing, there is a clear negative slope. This means that these types of testing have positive effects on achievement in countries at the lowest initial level of achievement, although the effects are only statistically distinguishable from zero for standardised external monitoring and standardised monitoring. However, the positive effects decrease as countries’ initial achievement levels increase. For the countries with the highest initial performance, there is some evidence of negative effects. Yet these are only statistically distinguishable from zero for standardised monitoring and internal teacher monitoring. The evidence is therefore most clearly positive when tests are high-stakes for both pupils and schools, but there is also clear evidence of diminishing returns.

Importantly, the countries analysed only include those in the PISA sample, which includes few low-income countries, many of which would have lower initial scores. In other words, it might be reasonable to assume that the positive effects for most low-income countries in fact would be even larger than the authors’ estimates indicate.

Pupil assessment is regularly used for accountability purposes in education systems worldwide. But context does really appear to matter for effectiveness. By and large, teachers in high-income countries loath standardised testing for a host of reasons. This paper suggests that there may be some truth to this, although there is in fact little evidence of negative effects in an absolute sense when tests are high-stakes for both pupils and schools. Furthermore, the paper also suggests that greater use of testing could be highly beneficial in low performing systems. While we should be careful not to make overly broad generalisations from experience in dramatically different contexts, the findings nevertheless provide importance new evidence of the effects of testing and test-based accountability worldwide.


Lee Crawfurd is a CfEE Fellow and Deputy Editor of its Monthly Research Digest.                                                                                         

Subscribe here to get monthly email notifications of the release of each issue.

You can download free copies of back issues of the CfEE Monthly Research Digest here.

Blog Category: 
About the author