Address to Oxford University, England
Thank you to each of you – randomistas and non-randomistas alike – for taking the time to join us today. I am grateful to my friend and co-author, the prodigiously productive Philip Clarke for making today’s talk happen, and to our four institutional hosts: Oxford Population Health’s REAL Supply and Demand Units, the Oxford Health Economics Research Centre, and the Oxford Centre for Health Economics.
This is the first talk I’ve given at Oxford since the passing of my extraordinary co-author Tony Atkinson on New Year’s Day 2017. Alongside many of you at Nuffield and the broader Oxford community, I was one of those whose work was shaped by Tony’s ideas and ideals. His smiling photo hangs on the wall behind my desk – a reminder that the best academics aren’t just brilliant and brave, but gentle and generous too.
Let’s start with a story.
In 1747, 31-year-old Scottish naval surgeon James Lind set about determining the most effective treatment for scurvy, a disease that was killing thousands of sailors around the world. Selecting 12 sailors suffering from scurvy, Lind divided them into six pairs. Each pair received a different treatment: cider; sulfuric acid; vinegar; seawater; a concoction of nutmeg, garlic and mustard; and two oranges and a lemon. In less than a week, the pair who had received oranges and lemons were back on active duty, while the others languished. Given that sulphuric acid was the British Navy’s main treatment for scurvy, this was a crucial finding.
The trial provided robust evidence for the powers of citrus because it created a credible counterfactual. The sailors didn’t choose their treatments, nor were they assigned based on the severity of their ailment. Instead, they were randomly allocated, making it likely that difference in their recovery were due to the treatment rather than other characteristics.
Lind’s randomised trial, one of the first in history, has attained legendary status. Yet because 1747 was so long ago, it is easy to imagine that the methods he used are no longer applicable. After all, Lind’s research was conducted at a time before electricity, cars and trains, an era when slavery was rampant and education was reserved for the elite. Surely, some argue, ideas from such an age have been superseded today.
In place of randomised trials, some put their faith in ‘big data’. Between large-scale surveys and extensive administrative datasets, the world is awash in data as never before. Each day, hundreds of exabytes of data are produced. Big data has improved the accuracy of weather forecasts, permitted researchers to study social interactions across racial and ethnic lines, enabled the analysis of income mobility at a fine geographic scale and much more.
Yet a clue to the value of randomised trials comes from the behaviour of the biggest big data company of them all, Google. Since its founding in 1998, Google has conducted thousands of randomised trials to refine its products. The company regularly conducts randomised trials (often dubbed A/B testing) to see how users prefer search results to be displayed, as measured by click-through rates. The company uses randomised trials to determine which features should be added to Google Maps, Google Docs and Gmail, balancing functionality against complexity. It runs randomised trials of its ad auctions, the way privacy settings are displayed, and recommendation algorithms. Among its employees, Google has conducted randomised trials to determine the optimal length of meetings, the impact of remote work, employee wellness programs and productivity tools.
Why would Google conduct randomised trials rather than using big data? Because it is keen to uncover causal effects. To see this, suppose that the company instead decided to determine the impact of product tweaks by looking at patterns in the data. For example, it could offer a new function in Google Sheets, and compare the productivity of users who took it up with the productivity of users who did not take it up. Such an analysis might also hold constant other observed factors about the two groups of users, such as how often they use the product.
The problem with such an analysis is that what isn’t observed can have a major impact on productivity. If users who like new functions are increasing their productivity at a more rapid rate, then this will bias the estimate upwards. Conversely, if users who like new functions are procrastinating, it will bias the estimate downwards. Google doesn’t know the true answer, so it opts for a randomised trial. In conducting its randomised trials, big data is a massive asset for Google. But big data doesn’t preclude the need to do randomised trials.
Another example arises in heart health (Collins et al 2020). Randomised trials have demonstrated a strongly beneficial effect of statins on reducing cardiovascular mortality. Yet when they analysed a database covering the entire Danish population, researchers found that the chance of death from cardiovascular causes was one quarter higher among those who took statins than among those who did not. The explanation is straightforward: people who were prescribed statins were at elevated risk of having a heart attack. Yet even when researchers made statistical adjustments, using all the variables available in the database, they were unable to reproduce the well-known finding that statins have a beneficial effect on cardiovascular mortality.
Analysis of the Danish database also suggested that the relative risk of cancer was 15 percent lower among patients who took statins, an effect that remained statistically significant even after controlling for other factors about the patients. Yet this result is at odds with the evidence from randomised trials. A meta-analysis of randomised trials, covering more than 10,000 cases of cancer, found no effects of statins on the incidence of cancer, nor on deaths from cancer. On average, these randomised trials covered a five-year period; longer than in the non-randomised database analysis.
The observational data was doubly wrong. Observational data failed to replicate the well-known finding that statins improve heart health. And observational data wrongly suggested that statins reduce the risk of cancer. Only randomised trials, which are not biased by selection effects, provided the correct answer.
A similar issue arose with estimating the health impact of hormone replacement therapy for postmenopausal women. In 1976, the Nurses’ Health Study began tracking over 100,000 registered nurses. The study found that women who chose to use hormone replacement therapy halved their risk of heart disease (Stampfer et al 1985). By the late-1990s, around two-fifths of postmenopausal women in the United States were using hormone replacement therapy – mostly to reduce the risk of heart disease. However, no randomised trial had evaluated the impact of hormone replacement therapy.
Then the National Institute of Health funded two randomised trials, comparing hormone replacement therapy against a placebo (Manson et al 2024). The trials, which began in 1993, did not support menopausal hormone replacement therapy to prevent coronary heart disease. Indeed, one of the randomised trials was stopped early because the data and safety monitoring board concluded that there was some evidence of harm. With the health of millions of women at stake, the early observational data had presented an inaccurate picture of the impact of hormone replacement therapy. The fact that the observational studies had a larger sample size than the randomised trials did not help. Lacking evidence from randomised trials, millions of women took a treatment that had an adverse impact on their health. Only randomised trials uncovered the truth.
Researcher Rory Collins and his coauthors refer to this as the ‘magic of randomisation’ (Collins et al 2020). Large datasets are a valuable complement to randomised trials. But big data is not a substitute for randomisation.
In a joint statement last year, the European Society of Cardiology, American Heart Association, American College of Cardiology, and the World Heart Federation (Bowman et al 2023) concluded that ‘The widespread availability of large-scale, population-wide, real world data is increasingly being promoted as a way of bypassing the challenges of conducting randomized trials. Yet, despite the small random errors around the estimates of the effects of an intervention that can be yielded by analyses of such large datasets, non-randomized observational analyses of the effects of an intervention should not be relied on as a substitute, due to their potential for systematic error.’ They call for measures to ensure that randomised trials are ‘fit for the twenty-first century’, addressing issues such as rising cost and complexity.
More evidence on the limitations of observational data comes from studies that look at the effect on health of what we eat and drink. Take one of the more controversial debates – over the health effect of alcohol.
In many studies, researchers using observational data had found that moderate alcohol drinkers tended to be healthier than non-drinkers or heavy drinkers. As bigger datasets emerged, they only confirmed the earlier studies – moderate drinkers lived longest. This led many doctors to advise their patients that a drink a day might be good for your health. Some former teetotallers took up drinking to get the health benefits. After all, who doesn’t want to live a few more healthy months or years?
Alas, a meta-analysis published last year concludes that this was a selection effect (Zhao et al 2023). In some studies, the population of non-drinkers included former alcoholics who had gone sober. The researchers also noted that when compared with non-drinkers, light drinkers are healthier on many dimensions, including weight, exercise and diet.
We don’t have large-scale randomised trials of moderate alcohol consumption, but we do have another form of random variation, arising from the fact that a portion of the population have genes that make them unable to tolerate alcohol. Studies that use these random differences in genetic predisposition to alcohol find no evidence that light drinking is good for your health (Biddinger et al 2022). A daily chardonnay isn’t as bad as a daily cigar, but modest doses of alcohol won’t extend your life. Abstainers who were persuaded to take up light drinking by the flawed observational studies were damaging their health.
The problem extends to just about every study you’ve ever read that compares outcomes for people who choose to consume one kind of food or beverage with those who make different consumption choices. Health writers Peter Attia and Bill Gifford point out that ‘our food choices and eating habits are unfathomably complex’, so observational studies are almost always ‘hopelessly confounded’ (Attia and Gifford 2023, p300).
A better approach is that adopted by the US National Institutes of Health, which is conducting randomised nutrition studies. These require volunteers to live in a dormitory-style setting, where their diets are randomly changed from week to week. Nutritional randomised trials are costlier than nutritional epidemiology, but they have one big advantage: we can trust the findings. They inform us about causal impacts, not mere correlations.
In policymaking, randomised trials have been deployed in unexpected places. Randomised trials of policing strategies have shown that hot spots policing reduces crime (Sherman and Weisburd 1995). A randomised trial of incarceration policies found that releasing prisoners six months early did not raise recidivism (Berecochea and Jaman 1981). A randomised trial found that when people in India were given a financial incentive to get their licence earlier, they were more likely to bribe the tester (Bertrand et al 2007). A randomised trials in Mexico found that road upgrades boost property prices and reduce poverty (Gonzalez-Navarro and Quintana-Domeque 2016). A randomised trial with airline pilots found that providing feedback on fuel use led captains to be more economical, saving the airline a million litres of fuel (Gosnell et al 2016). Economists are also integrating the findings from randomised trials into macroeconomic models (Buera, Kaboski and Townsend 2023), and using field experiments to carefully test economic theories (Banerjee 2020).
Yet by comparison with health, the uptake of randomised trials in social sciences remains modest. Last month, the Global Evidence Report: A Blueprint for Better International Collaboration on Evidence, authored by David Halpern and Deelan Maru, reported on the volume of randomised trials in health and the social sciences over time. From the 1990s to the 2020s, the number of randomised trials in health has exploded from 10,000 to almost 250,000. Yet over the same period, the number of randomised trials in the social sciences has risen from a few thousand to less than 20,000. For every randomised trial in the social sciences, there are around ten randomised trials in health. This is all the more startling given the breadth of the social sciences, covering education, crime, employment, homelessness and political engagement. In budgetary terms, governments spend much more on those areas than on health alone. Yet in terms of randomised trials, health remains far further ahead.
In Australia, a study from the think tank CEDA examined a sample of 20 Australian Government programs conducted between 2015 and 2022 (Winzar et al. 2023). The programs had a total expenditure of over A$200 billion. CEDA found that 95 per cent were not properly evaluated. CEDA’s analysis of analysis of state and territory government evaluations reported similar results. As the CEDA researchers note, ‘The problems with evaluation start from the outset of program and policy design’. Across the board, CEDA estimates that fewer than 1.5 per cent of Australian government evaluations use a randomised design (Winzar et al. 2023, 44).
The relatively small number of randomised trials of social programs is particularly troubling given what the evidence tells us about the programs that are rigorously evaluated. In health, only one in ten drugs that look promising in the laboratory make it through Phase I, II and III clinical trials and onto the market (Hay et al 2014). In education, an analysis of randomised trials commissioned by the US Department of Education’s Institute of Education Sciences found that only one in ten produced positive effects (Coalition for Evidence-Based Policy 2013). And remember all those trials that Google is doing? Google estimates that just one in five of their randomised trials help them improve the product (Thomke 2013).
This suggests that the best approach to policymaking is what US President Franklin D. Roosevelt once called ‘bold, persistent experimentation’ (Roosevelt 1932). If many promising policies do not work as well as intended, then rigorous evaluation is essential to building a cycle of continuous improvement. Rigorous evaluation guarantees that government policies in a decade’s time will be more effective than they are today. A failure to evaluate runs the risk that we will unwittingly repeat our mistakes. Evaluation puts us in a virtuous feedback loop. Without it, we can end up in a doom loop.
How can we encourage more rigorous evaluation? There are five approaches that can promote more high-quality evaluations, especially randomised trials.
- Curiosity.
Employees quickly come to understand the culture of an organisation. Some entities encourage questioning, while others favour conformity. Fostering a culture of ‘why’ doesn’t come naturally for many managers. Questions can be perceived as time-consuming or distracting. Yet when managers make clear that they value new insights, they give permission for everyone in the organisation to question accepted wisdom and gather better evidence.
David Cowan, who has run multiple randomised trials of policing programs in Victoria, notes that managers can enable randomised trials by simply asking questions such as ‘do we know if it works?’ and ‘how could we find out?’. Strong organisations encourage the philosophy of what the UK Behavioural Insights Team famously dubbed ‘Test-Learn-Adapt’ (Haynes et al 2012).
- Simplicity.
Some of the most famous randomised trials are among the most complex. The Perry Preschool Project, the RAND Health Insurance Experiments and the Moving to Opportunity experiment cost millions of dollars and took many years. These trials have had a major impact on public policy, but a side-effect is that they have left some people with the mistaken impression that randomised trials must always be costly and time-consuming. Yet many experiments can be quick and simple.
Government officials charged with sending out letters, emails or text messages should have the functionality to send two versions, so they can continuously improve the language and messaging of their correspondence. This kind of A/B testing has been standard for market research companies for decades, yet remains rare in the public sector.
Another initiative is grant rounds to fund low-cost randomised trials. The Paul Ramsay Foundation, Australia’s largest charitable foundation, just issued a call for proposals for seven projects of up to A$300,000 to be randomly evaluated. A similar approach has been adopted for the past decade by the Laura and John Arnold Foundation, who have funded a swath of low-cost trials across social policy.[1]
- Ethical.
Subjecting randomised trials to appropriate ethical scrutiny isn’t just the right thing to do; it’s also important for creating an environment in which further trials can be conducted. Ethical scrutiny – commonly known in the United States as an Institutional Review Board process – ensures that the interests of vulnerable people are taken into account, and that the trial can be expected to improve overall wellbeing.
The ethical review process can also encourage researchers to think creatively about how to generate random variation in instances where researchers have strong prior beliefs that it will be effective. For example, suppose a program thought to have a positive effect was being rolled out nationwide over a two-year period. A randomised trial might be conducted by randomising the regions that receive the program first. This is the approach taken by Karthik Muralidharan, Paul Niehaus and Sandip Sukhtankar (2023), who worked with the government of the Indian state of Jharkhand to randomise the order in which it introduced biometrically linked identity numbers across 132 sub-districts, covering 15 million people. This so-called ‘stepped wedge’ design produces rigorous causal impacts, while ensuring that everyone ends up with access to the program.
Another approach is to evaluate the impact of a universal program by advertising it to a randomly selected group of eligible people, thereby inducing differences in take-up. Such an approach, known as an ‘encouragement design’ was used by Amy Finkelstein and Matthew Notowidigdo (2019), who promoted the Supplemental Nutrition Assistance Program to a group of people who were eligible but not enrolled. By inducing variation in take-up, they were able to show that the program had a positive causal impact on recipients’ health and income. From an ethical perspective, it is worth noting that the control group were not denied access to the program; they simply failed to receive the promotional materials.
- Institutional.
Having bodies that promote high-quality evaluation can help to provide toolkits, seminars and nudges to inform and entice decision-makers towards rigorous evaluation approaches. In the UK, the Behavioural Insights Team, the What Works Centres, the Evaluation Taskforce and the Magenta Book (HM Treasury 2020) have provided a powerful impetus towards improving the quality and quantity of rigorous evaluations. Last month, we saw new support for living evidence synthesis from the Wellcome Trust and Economic and Social Research Council, and a new collaboration between JBI, Cochrane and Campbell to build a truly global evidence ecosystem.
Last year, our government established the Australian Centre for Evaluation. Located within the Australian Treasury, the centre has a budget of around A$2 million per year, and a staff of around a dozen people. Its mandate is to ‘put evaluation evidence at the heart of policy design and decision-making’. The main goal of the centre is to work collaboratively with government departments to conduct rigorous evaluations, especially randomised trials. Several trials are presently underway, including experiments to improve the quality of employment services. The Australian Centre for Evaluation has also trained hundreds of public servants in evaluation and is presently preparing its first report on the state of evaluation across the public service.
Beyond centres, other institutional features that can help embed best practice evaluation include rules around the evidence that accompanies cabinet submissions and expectations about evaluation for programs with sunset clauses. Another way that evaluation can be made more routine is if it becomes part of the expected toolkit for public sector managers. Just imagine how much the volume of randomised policy trials might grow if public servants know that they will eventually be asked in a promotion interview ‘tell me about a randomised trial you have conducted?’.
- International.
A few years ago, when researching my book Randomistas (Leigh 2018), I met with a kidney health researcher whose work involved running large-scale randomised trials. He told me that he no longer worked on single-country trials. Multi-country trials, he told me, provided an inbuilt replication function, and greater assurance that interventions worked across people of different ethnicities.
Perhaps this approach has something to teach those of us seeking to better understand what works in policy. In their Global Evidence Report: A Blueprint for Better International Collaboration on Evidence, David Halpern and Deelan Maru advocate for countries to collaborate on evidence Living Evidence Reviews – research syntheses on key topics such as homelessness, job training or policing. To begin, they argue that such a collaboration could begin with the UK, US, Canada and Australia. It would enable sharing of what works and what doesn’t, as well as a recognition of the evidence gaps. The report also discusses other collaboration opportunities, such as a shared evaluation fund and international public service professional networks.
Other international institutions might also be mobilised. In 2022, the Global Commission on Evidence report, for which David Halpern and I both served as commissioners, argued that the World Bank should devote a future World Development Report to how best to produce, share and use evidence (Global Commission on Evidence to Address Societal Challenges 2022). Other international institutions, including the OECD, could play a valuable role in encouraging more rigorous evaluation, and pointing out what economists have long known: that a well-conducted randomised trial that produces clear evidence can have a higher benefit-cost ratio than almost anything else we do in public policy.
So that’s my five approaches to expanding randomised trials. Encourage curiosity in yourself and those you lead. Seek simple trials, especially in the beginning. Ensure experiments are ethically grounded. Foster institutions that push people towards more rigorous evaluation. And collaborate internationally to share best practice and identify evidence gaps.
James Lind’s scurvy trial was pathbreaking. Alas, his writeup left something to be desired. Six years after his experiment, Lind published the 456-page tome A Treatise of the Scurvy. His experimental results were spot-on, but Lind’s theoretical explanations for why citrus worked were hocus-pocus. The treatise was largely ignored.
Then, in the 1790s, a disciple of Lind, surgeon Gilbert Blane, was able to persuade senior naval officials that oranges and lemons could prevent scurvy. In 1795 – almost half a century years after Lind’s findings – lemon juice was issued on demand; by 1799 it became part of the standard provisions. By the early 1800s British naval sailors were consuming 200,000 litres of lemon juice annually.
The British may have been slow to adopt Lind’s findings, but they were faster at curing scurvy than their main naval opponents. An end to scurvy was a key reason why the British, under the command of Admiral Lord Nelson, were able to maintain a sea blockade of France and ultimately win the 1805 Battle of Trafalgar against a larger force of scurvy-ridden French and Spanish ships. Nelson’s had clever tactics, but it helped that he didn’t have to fight while scurvy ravaged his crew.
So when you next find yourself looking up at Nelson’s Column in Trafalgar Square, spare a thought for James Lind, who showed us that a curious mind, conducting a simple randomised trial, really can change the course of history.
References
Attia P and Gifford B (2023) Outlive: The Science and Art of Longevity, Harmony, New York.
Banerjee, A.V., 2020. Field experiments and the practice of economics. American Economic Review, 110(7), pp.1937-1951.
Berecochea, John E and Dorothy R. Jaman, Time Served in Prison and Parole Outcome: An Experimental Study: Report, No. 2. Research Division, California Department of Corrections, 1981.
Bertrand, Marianne, Simeon Djankov, Rema Hanna & Sendhil Mullainathan, ‘Obtaining a driver’s license in India: An experimental approach to studying corruption’ Quarterly Journal of Economics, vol. 122, no. 4, 2007, pp. 1639–76.
Biddinger K, Emdin C, Haas M, Wang M, Hindy G, Ellinor P, Kathiresan S, Khera A, Aragam K (2022) ‘Association of Habitual Alcohol Intake With Risk of Cardiovascular Disease’ JAMA Network Open, 5(3), e223849-e223849.
Bowman, L., Weidinger, F., Albert, M.A., Fry, E.T., Pinto, F.J. and Clinical Trial Expert Group and ESC Patient Forum, 2023. Randomized Trials Fit for the 21st Century: A Joint Opinion From the European Society of Cardiology, American Heart Association, American College of Cardiology, and the World Heart Federation. Journal of the American College of Cardiology, 81(12), pp.1205-1210.
Buera, F.J., Kaboski, J.P. and Townsend, R.M., 2023. From micro to macro development. Journal of Economic Literature, 61(2), pp.471-503.
Coalition for Evidence-Based Policy, (2013) ‘Randomized controlled trials commissioned by the Institute of Education Sciences since 2002: How many found positive versus weak or no effects’, July.
Collins, R., Bowman, L., Landray, M. and Peto, R., 2020. The magic of randomization versus the myth of real-world evidence. New England Journal of Medicine, 382(7), pp.674-678.
Finkelstein, A. and Notowidigdo, M.J., 2019. Take-up and targeting: Experimental evidence from SNAP. Quarterly Journal of Economics, 134(3), pp.1505-1556.
Global Commission on Evidence to Address Societal Challenges. 2022. The Evidence Commission report: A wake-up call and path forward for decisionmakers, evidence intermediaries, and impact-oriented evidence producers. Hamilton: McMaster Health Forum.
Gonzalez-Navarro, Marco and Climent Quintana-Domeque, ‘Paving streets for the poor: Experimental analysis of infrastructure effects’, Review of Economics and Statistics, vol. 98, no. 2, 2016, pp. 254–67.
Gosnell, Greer K., John A. List and Robert Metcalfe, 2016, ‘A new approach to an age-old problem: Solving externalities by incenting workers directly’, NBER Working Paper No. 22316, Cambridge, MA: NBER
Halpern, D. and Maru, D. 2024. Global Evidence Report: A Blueprint for Better International Collaboration on Evidence. Behavioural Insights Team, NESTA and ESRC, London.
Hay, M et al, 2014, ‘Clinical Development Success Rates for Investigational Drugs’, Nature Biotechnology, 32(1): 40-51.
Haynes, Laura, Owain Service, Ben Goldacre, and David Torgerson. 2012. Test, Learn, Adapt: Developing Public Policy with Randomised Controlled Trials, Behavioural Insights Team, Cabinet Office.
HM Treasury, 2020, Magenta Book: Central Government guidance on evaluation, HM Treasury, London.
Leigh, A. 2018. Randomistas: How Radical Researchers Changed Our World. Black Inc, Melbourne.
Manson, J.E., Crandall, C.J., Rossouw, J.E., Chlebowski, R.T., Anderson, G.L., Stefanick, M.L., Aragaki, A.K., Cauley, J.A., Wells, G.L., LaCroix, A.Z. and Thomson, C.A., 2024. The Women’s Health Initiative Randomized Trials and Clinical Practice: A Review. JAMA. 331(20):1748–1760
Muralidharan, K., Niehaus, P. and Sukhtankar, S., 2023. Identity verification standards in welfare programs: Experimental evidence from India. Review of Economics and Statistics, pp.1-46.
Roosevelt, F.D. 1932. ‘The New Deal’, Oglethorpe University Address, 22 May.
Sherman, L. W. and Weisburd, D. 1995. General deterrent effects of police patrol in crime hot spots: A randomized controlled trial. Justice Quarterly, vol. 12, pp. 625-648.
Stampfer, M.J., Willett, W.C., Colditz, G.A., Rosner, B., Speizer, F.E. and Hennekens, C.H., 1985. A prospective study of postmenopausal estrogen therapy and coronary heart disease. New England Journal of Medicine, 313(17), pp.1044-1049.
Thomke, S, 2013, ‘Unlocking Innovation Through Business Experimentation’, European Business Review, 10 March.
Winzar C, Tofts‑Len S and Corpuz E (2023) Disrupting Disadvantage 3: Finding What Works, CEDA, Melbourne.
Zhao J, Stockwell T, Naimi T, Churchill S, Clay J, Sherk A (2023) ‘Association Between Daily Alcohol Intake and Risk of All-Cause Mortality: A Systematic Review and Meta-analyses’ JAMA Network Open, 6(3), e236185-e236185.
[1] Low-cost trials need not be low-impact trials. Jon Baron, president of the Coalition for Evidence Based Policy, notes that most of the low-cost randomised trials that the Arnold Foundation funds are of more substantial interventions, and often include multi-year follow-up. Baron gives the example of a college advising program that cost US$4000 per student, had a large multi-site sample (2400 students) and a six-year follow-up. Yet the research component of the study (not including the cost of program delivery, which would otherwise have been delivered in a non-randomised way) cost just US$200,000. The low cost was achieved by using a randomised lottery to allocate program slots (since the program was oversubscribed) and measuring all outcomes with administrative data such as college enrolment and completion.