CHANGING THE WORLD, ONE COIN TOSS AT A TIME
Evidence and Implementation Summit, Melbourne
Wednesday, 11 October 2023
I acknowledge the people of the Kulin Nations as traditional custodians of the land and pay my respects to their Elders past and present. I commit myself to the implementation of the Uluru Statement from the Heart, which starts with voting Yes this Saturday.
I thank the Monash University, the National University of Singapore and the Centre for Evidence and Implementation for hosting today’s Summit. It’s terrific to see so many of you dedicated to closing the ‘know-do gap’– the gap between what we know and what we do.
The title of my presentation is ‘Changing the World, One Coin Toss at a Time’. I chose this title because the simple act of tossing a coin can help us get the evidence we need to address our most difficult problems. Heads, they receive the intervention. Tails, they’re in the control group. From there, we can establish a counterfactual and begin to evaluate what works and what doesn’t work.
Recently, at the National Press Club and the Australian Evaluation Society Conference, I’ve spoken about randomised trials, its origins in medicine and the need to embed evaluation in the work of government. I’ve spoken about social impact and how rigorous evidence can give us an accurate picture of program effectiveness. I’ve also spoken about how the increased availability of large, integrated administrative datasets can help us conduct evaluations more quickly and cheaply, making data and evaluation a match made in policy heaven. Today, I want to zoom out a little and discuss what best practice use of evidence looks like.
Best practice use of evidence
In doing that, I’m going to start by drawing on the Global Commission on Evidence Report that came out last year, on which I served as a one of 25 commissioners (Global Commission on Evidence to Address Societal Challenges 2022).
Led by Professor John Lavis and a secretariat at McMaster University in Canada, the report concluded that ‘Evidence… is not being systematically used by government policymakers, organisational leaders, professionals and citizens to equitably address societal challenges.’
The Global Commission on Evidence Report provides three headline observations.
Everyone and everyday
First, evidence is for everyone.
The Global Commission report says we need to put evidence at the centre of our everyday lives.
At an individual level we need access to sound evidence to make the right choices about our wellbeing, about the products and services we receive or are supported by, or the causes to which we donate.
Government policymakers, organisation leaders and professionals are included in that definition of everyone.
We need to use evidence to its best advantage and build it into our decision-making processes to improve the lives of the people and communities we serve.
Second, we need to apply critical thinking to differentiate between stronger and weaker forms of evidence.
For example, a single study can generate a lot of media excitement.
But the Global Commission recommends that when we hear about such studies, we should seek out a critical appraisal.
Better yet, we should look for an evidence synthesis that incorporates the study along with related studies to understand the effect of a particular policy or program.
Similarly, we should be wary of ‘eminence-based’ decision-making when it replaces ‘evidence-based’ decision-making.
We will always need to rely on experts because we don’t have time to develop the same level of knowledge and expertise.
But the best experts will be able to share and summarise the evidence – or even better, the evidence synthesis – that their advice is based on.
Third, especially in government, we need to nurture all elements of our national ‘evidence infrastructure’ (Global Commission on Evidence to Address Societal Challenges 2022, Section 4.14).
The ‘research system’, such as universities, and research funding bodies like the Australian Research Council, is a central part of this infrastructure.
But the Global Commission urges us to pay attention to other elements of the evidence infrastructure.
The ‘evidence-support system’ refers to the capability, processes and units that support evidence use.
In Australia, this includes the requirements for evidence and evaluation set out in the Budget Process Operational Rules and which our Government has made public for the first time.
The ‘evidence-implementation system’ refers to organisations engaged in evidence synthesis and preparation of guidelines based on those syntheses.
It also refers to the teams inside and outside government that are thinking about how to implement evidence.
The Education Endowment Foundation
So what does best practice use of evidence look like and can it work in the real world of policy making and implementation?
The Education Endowment Foundation was established by the UK Government to generate, share and use high quality evidence to inform teachers, school principals and policy makers about what works in improving academic outcomes for disadvantaged students.
In effect, the Foundation straddles several features of the ‘evidence infrastructure’, since it is involved in:
- the ‘research system’ – for example, by commissioning new research, and
- the ‘evidence-implementation system’ – by preparing evidence syntheses and guidelines.
Robust evaluations, such as randomised trials, are a hallmark of the Foundation’s work.
In fact, it is one of the leading bodies commissioning randomised trials in education globally, delivering about 19 per cent of all known education trials in the past 10 years (Edovald and Nevill 2020).
Furthermore, the Foundation has commissioned some of the largest education trials.
Sample size matters for generating more precise results. It also allows researchers to conduct analyses for different sub-groups to determine not only what works but also for whom.
Over the years, the Foundation has built a knowledge base from which we can learn how to generate and use evidence.
One of the best features of the Education Endowment Foundation is that it makes research accessible.
Teachers and decision makers in the education community have classes to teach and schools to run. They don’t have time to sift through mountains of studies.
To help them compare and select programs, the Foundation uses a consistent approach and process to deliver and report evaluation results.
The Foundation’s Dashboard uses a padlock rating system to help teachers interpret complex trial results and understand the level of confidence they can have in the evaluation results.
An evaluation receiving five padlocks is the equivalent to fine dining at a hatted restaurant. It means you can trust the evidence is of the highest quality.
By contrast, a rating of zero padlocks would be a hit-or-miss pizza joint – in other words, the study adds little to the evidence base (and might cause indigestion).
When educators ask for support for a new program, cost is likely to be the first thing that comes up in that conversation.
The Dashboard rates the cost effectiveness of interventions. For example, the University of Oxford evaluated a program called 1stclass@number – a numeracy intervention program used by over 4,000 schools (Education Endowment Foundation 2018).
The researchers found pupils in the program made, on average, two additional month’s progress in maths. The evaluation was rated four padlocks – a high security rating. And the intervention was well rated on implementation costs at £77 per student.
The Breakfast Club
The Foundation lists more than 200 projects, including the Institute of Fiscal Studies’ evaluation of the Magic Breakfast Project in 2017.
The Magic Breakfast Project is a free, universal, before-school breakfast club for Year 2 to 6 pupils in the UK aimed at improving academic outcomes by increasing the number of children who ate a healthy breakfast (Education Endowment Foundation 2017).
Controlling for variables such as age, school level, socio-economic status and others, the trial suggests two months’ additional progress was achieved for Year 2 children but no benefits for Year 6 children although this evidence only achieved a ‘two padlock’ security rating.
Why two padlocks? The study was originally designed as a randomised trial (which might have earned it a five padlock rating), but there was an error in the randomisation procedure. Consequently, it was re-analysed using regression analysis, which meant the evaluation was not as reliable.
The main impact evaluation was accompanied by a process evaluation, comprising a survey of teachers, and extensive interviews with parents, teachers, children and delivery staff. These were used to understand barriers to take‑up, and features of effective delivery of the breakfast clubs.
What does it tell us? It's not just eating breakfast that delivers improvements but attending the club. This could be due to the content of the breakfast itself or to other social or educational benefits of the club.
Teachers reported in a survey that student behaviour improved in breakfast clubs. This is interesting because breakfast clubs may improve outcomes for children who do not even attend breakfast club by improving classroom environments.
A basic implementation challenge for the program was how to increase take-up of the breakfast provision. One approach was to promote it to parents and encourage all children to attend while sensitively targeting pupils most likely to benefit.
The promising results from the Magic Breakfast trial led to the expansion of breakfast clubs to 1,775 schools in a National School Breakfast Program, addressing the concerns of the original evaluation by delivering several models.
This included a traditional sit-down breakfast club, a healthy ‘grab and go’ breakfast in the playground or school entrance, and breakfast in the form of a ‘soft start’ where classrooms opened early for breakfast (Education Endowment Foundation 2021).
Evidence for Learning’s evaluations
Coin tossing and education evaluations are also occurring in Australia too.
Established in 2015, the Australian spinoff of the Education Endowment Foundation is the Evidence for Learning initiative.
Like its UK counterpart, Evidence for Learning aims to help make ‘great practice become common practice’ by improving ‘the quality, availability and use of evidence in education’ (E4L 2023).
In doing so, Evidence for Learning has published the results and summaries of randomised trials evaluating the effectiveness of several education programs in Australian schools.
The independent evaluation reports are valuable because they provide key considerations for teachers, school leaders, program developers and decision makers at the system level.
In one example, the Teachers and Teaching Research Centre at the University of Newcastle evaluated the QuickSmart Numeracy program in 2019 (E4L 2019).
Teachers delivered the QuickSmart tutoring program over 90 sessions across 30 weeks to Year 4 and Year 8 students to develop their fluency in basic maths operations.
The researchers assessed more than 280 students from 23 schools in the tutoring program against a control group of students who continued with their regular maths classes.
There was strong evidence that QuickSmart improved primary school students’ interest and confidence in maths but it didn’t have an additional impact on maths achievement. This evaluation has a high-evidence rating of four padlocks.
This is an important finding that reflects an issue we often see in evaluation – while interest and confidence increased, these changes did not translate to an improvement in the main outcome of interest.
The researchers found schools had troubles timetabling the sessions. None of the secondary students achieved 90 per cent or more of the program.
Consequently, the apparent failure of QuickSmart to improve maths achievement may be due to implementation challenges and program design.
The next step for policymakers is to decide whether it’s worth investing in achieving better content design and attendance rates, or whether they should cut their losses and look elsewhere.
I’ll come back to the issue of implementation later.
The next Australian example is ‘one of the largest randomised controlled trials in education in Australia’ (E4L 2018).
In September 2018, the Australian Council for Education Research evaluated the Thinking Maths program – a program for maths teachers to better engage with middle-school students.
The program evaluation involved more than 150 schools in South Australia who were randomly assigned to an intervention group or a control group.
Importantly, no one missed out on the program. The first group started the program in term 1, while second group acted as a control group and started in term 4. That’s a common way of conducting a randomised trial, because it ensures that the control group eventually gets the intervention. If programs have to be rolled out progressively, why not randomise and learn from the rollout?
Thinking Maths required maths teachers to attend 30-hours of face-to-face professional learning to make their maths teaching more engaging.
The evaluation found the Thinking Maths program had a small positive effect on students, mainly on the primary school cohort.
However, researchers said the largest statistically significant effect was on the teachers including their content knowledge, professional identity and self-efficacy.
These evaluations are a positive start but there’s more work to do.
Collectively, such evaluations can provide a fuller understanding of what works across a range of programs and practices, in the context of Australia’s schools.
Teachers can weigh up information about effectiveness, cost and the quality of the evidence to make informed decisions.
And the evaluations offer valuable lessons to improve how we produce and share better evidence.
Effective educational interventions depend not just on having a great idea, but implementing it effectively.
Nowhere is this better illustrated than in a new research paper by economists Noam Angrist and Rachael Meager. They explore a policy known as ‘targeted instruction’, which involves regrouping students by learning proficiency rather than by grade.
In low and middle income countries, this approach uses customized and engaging teaching and learning that is targeted to the learning level of the child. The World Bank, government aid agencies and non-governmental donors are actively engaged in scaling up targeted instruction programs such as India’s ‘Teaching at the Right Level’ program.
Yet Angrist and Meager (2023) show that while the impact of targeted instruction programs is positive, the effects vary markedly – from 0.07 to 0.78 standard deviations. That’s a tenfold difference in effectiveness. Using a new randomised trials in Botswana, they show that when the program is implemented with high fidelity, it can deliver average gains of 0.8 standard deviations. Improving implementation is as important as having a good program in the first place.
While this finding comes from a developing country context, the broad lesson applies to Australia. It’s not enough to have good interventions – we also need to ensure that they’re implemented effectively. If we want to change children’s lives for the better, we need to sweat the details.
Australian Centre for Evaluation
Many of the problems governments face in public policy are difficult. If it were easy to close life expectancy gaps, educational gaps or employment gaps, then past generations would have done it already. The fact these challenges persist means good intentions aren’t enough.
So there is an opportunity for governments to become better consumers of evidence.
We can raise the bar by making sure claims about a policy or program’s effectiveness are based on quality evidence.
By tossing a coin and establishing the ‘counterfactual’ – that is, establishing what would have happened if the policy wasn’t implemented – the Australian Public Service can also help produce and share evidence about what works and what doesn’t work.
We’ve established the Australian Centre for Evaluation within Treasury to provide leadership and to make rigorous evaluation a normal part of developing policy programs.
Over the years, the Australian Centre for Evaluation will improve the volume, quality, and impact of evaluations across the Australian Public Service.
As well as championing randomised trials and other high‑quality impact evaluations, the Centre will partner with government agencies to initiate a small number of high‑quality impact evaluations each year.
I can announce today that the ACE has entered into its first partnership agreement to support high-quality impact evaluations. This partnership will be with the Department of Employment and Workplace Relations and will use randomised trials to evaluate different features of online employment services. All planned trials will be subject to ethics review, consistent with the National Statement on Ethical Conduct in Human Research. We will provide further details on these trials when they are further advanced, but the intention is that they will commence in 2024.
In our favourite TV cop shows, the crucial evidence is guarded under lock and key. But the policy evidence locker should be just the opposite.
Evidence should be for everyone. It’s not about convicting a wrongdoer, it’s about providing lessons to avoid repeating – often costly – policy mistakes.
The Evidence Commission’s report goes to the value of focusing on evidence synthesis rather than single studies or sources. In the detective world, this is akin to not only having fingerprints but also CCTV footage and reliable witnesses to corroborate the evidence.
After working on the Evidence Commission report, I’m even more passionate about the need for randomised policy evaluations.
We can learn from the Education Endowment Foundation model.
It shows us that it’s possible to find interventions that are both low-cost and proven to be effective. It shows the club is just as important as the breakfast.
But most of all, the Foundation shows that it’s possible to build a knowledge base from which we can generate and use evidence to improve outcomes. Likewise, the Australian offshoot, Evidence for Learning is following in the Foundation’s footsteps.
All levels of governments have a role in evaluating their policies and closing the know-do gap. We’ve established the Australian Centre for Evaluation to lead the way in building a better feedback loop and help build a culture of continuous improvement within the Australian Public Service.
We can improve our policy interventions and address our most difficult challenges, one coin toss at a time. But the coin toss is only the start of the evidence building process. We also need to bank evidence, share it and implement it.
* My thanks to officials in the Australian Treasury for assistance in drafting these remarks.
Angrist, Noam and Rachael Meager. 2023. ‘Implementation Matters: Generalizing Treatment Effects in Education’, Working Paper.
Edovald, T, and Nevill, C 2021 ‘Working Out What Works: The Case of the Education Endowment Foundation in England’ ECNU Review of Education, 4(1), 46–64.
Education Endowment Foundation 2018 1stClass@number [University of Oxford, evaluation completed July 2018]
Education Endowment Foundation 2017 Magic Breakfast: a free, universal, before-school breakfast club [Summary of an evaluation completed by The Institute of Fiscal Studies in 2017].
Education Endowment Foundation 2021 National School Breakfast Programme [Summary of an evaluation completed by The Behavioural Insights Team in 2021].
Evidence for Learning (E4L) 2023 Who are we and what we do.
Evidence for Learning (E4L) 2019 QuickSmart Numeracy [Teachers and Teaching Research Centre at the University of Newcastle evaluation of QuickSmart Numeracy, Evaluation competed in April 2019].
Evidence for Learning (E4L) 2018 Thinking Maths [Australian Council for Education Research, evaluation completed September 2018].
Global Commission on Evidence to Address Societal Challenges 2022 The Evidence Commission report: A wake-up call and path forward for decisionmakers, evidence intermediaries, and impact-oriented evidence producers. Hamilton, Canada. McMaster Health Forum.