25 March 2026

AI and the future of policy evaluation

At PUBLIC, we care deeply about evaluating the impact of technology projects, products and services. We don’t think this should be treated as an after-thought, but should be a key part of the delivery of any digital project, from the beginning.

Find out more

Two years ago, PUBLIC published our first guidebook on evaluating digital projects in the public sector. We covered the fundamentals, such as Theories of Change, method selection, and how to evaluate technology tools in fast-changing environments, while complying with the Treasury’s guidance (like the Green and Magenta Books). We gave the example of a fictional AI consultation tool - ‘CivicAI’ - to bring our approach to life. Shortly afterwards, i.AI announced that they had been building a similar tool - ‘Consult’!

We then published a second edition, going deeper into economic evaluation methods. This guide focused on translating traditional economic frameworks into tools that can be practically used to support digital and data teams. Alongside these guides, we have now conducted independent evaluations of 20+ UK Government digital projects, working with the Evaluation Task Force, MHCLG, DfT, MOJ, Companies House, and more.

We are now working on our third edition. This will explore the role of AI in monitoring and evaluation, and how evaluation teams can start safely and practically start using these tools.

‍

How can AI support evaluation?

We think that there is a significant gap between what is now technically possible with AI in evaluation, and what is actually happening. A recent OECD report exploring AI adoption across governments found that across 200 AI use cases in member states, evaluation had the fewest, with just five.

To support the effective evaluation of policy and public sector projects, there is an opportunity for evaluation practitioners to harness the productivity gains of AI whilst building the guardrails necessary to manage it safely and securely.

The most important distinction for evaluators using AI is between a standard large language model (LLM) and an AI agent. An LLM, the kind of AI behind ChatGPT or Claude, is well suited to one-off tasks: “summarise this document”, “help draft this section”, “explain this method”. An agent goes further. It can execute multi-step workflows autonomously: reading files, writing and running code, checking its own outputs, iterating.

For M&E, which is inherently a multi-step process, agentic capabilities open up new and powerful possibilities. AI offers immense potential to make evaluations quicker and more robust, whilst being cognisant that we must be highly discerning about what we automate and what we keep as human tasks.

‍

Where are we already seeing the benefits of AI?

Across the M&E lifecycle, we are already seeing real, deployable applications of AI tools.

AI-assisted evidence synthesis is probably the most mature area. Tools can now search, screen, and summarise bodies of literature at a scale that would take human teams weeks. For evaluation teams scoping a new programme area, or interested in exploring what some other field could say about their topic, this is genuinely useful today.

A recent example of this is the development of systems like InsightAgent, a multi-agent framework designed for complex systematic reviews. Researchers demonstrated that this tool could partition a massive amount of literature, read and synthesise findings, and draft a rigorous review in just 1.5 hours - a process that traditionally takes months to complete manually. Researchers could also visually monitor the AI's reading trajectory, adjust its inclusion criteria, and verify its sources in real-time.

‍

AI-led qualitative interviews - including voice - have been shown to generate substantially richer responses than conventional open text fields. For public sector evaluations, the possibility of running qualitative research at a fraction of the cost is a meaningful shift. Similarly, these practices are effective where there are multiple layers of governance - such as evaluation framework development and qualitative evaluations of ‘unmonetisable’ outcomes, as per the Green Book.

For example, PUBLIC recently utilised Salomo to conduct user research for a major public sector project. Traditionally, gathering and synthesising user research at this scale would take a team of multiple researchers many months to complete. However, by leveraging Salomo's agentic capabilities, a team of just two researchers was able to process, code, and extract insights from 100 interviews in less than a week.

‍

Getting to concrete outputs and models more quickly. Analysis and reporting workflows are starting to allow evaluators to go from a research question to a documented, reproducible output - with code, findings, and visualisations - in a fraction of the time previously required.

For example, AI Scientist-V2, is a system capable of automating the scientific research lifecycle. Given a high-level prompt, the agent autonomously formulates hypotheses, writes and debugs experiment code, visualises data, and drafts a complete manuscript in under 15 hours. It also recently produced a research paper that successfully passed a double-blind peer review.

‍

While public sector policy evaluation has its own unique complexities and stakeholder dynamics, the implication is clear. These are tools that can handle the heavy mechanical execution - running the econometrics, generating charts, and drafting technical annexes - freeing up evaluators to focus on the harder interpretive questions and policy implications.

Finally, two recent projects give a general sense of where the cutting edge capabilities look like. The Autonomous Policy Evaluation project at the University of Zurich is building systems that generate research questions, design evaluations, and run replications, not just as a curiosity, but as a serious research programme. And Claude Blattman, built by Chicago political economist Chris Blattman, is a practical guide that helps researchers, including those with no technical background, start using AI agents to improve and accelerate their work. Both projects are noteworthy, not because they represent where most evaluation teams are, but because they represent where the frontier already is.

‍

How should we consider the risks of AI?

Conversations about AI often focus on its current risks and limitations - like hallucinations, data privacy, and the risk of errors going undetected. However, there is a less-discussed risk that deserves more attention, particularly for evaluation teams: the quiet erosion of human judgment through over-delegation. This is a risk we’re keen to dive into, and design ways for evaluation teams to avoid.

For example, a randomised experiment by researchers at Anthropic found that software engineers who used AI assistance scored significantly lower on skill assessments than those who did not. This is not because the AI made them worse, but because certain patterns of use prevented them from developing the judgment they needed to catch errors in AI outputs.

Unlike structured operational tasks where a failure is evident, the risks of deskilling in evaluation are often invisible to a non-practitioner. A subtly flawed estimation or poor qualitative coding by an AI agent will still produce a highly convincing, authoritative-looking report. Complex impact evaluation or financial models can be created with subtle, unnoticeable errors, especially when overseen by non-technical teams.

We want to support evaluation teams through these challenges. For example, by using multi-agent review and flagging systems, or by designing effective ‘spot-check’ approaches, or by building methodologies to ensure appropriate human sampling for some of the research. While the evaluation and research sector is getting to grips with AI, we will help to produce step-by-step frameworks for teams to use AI tools safely, ensuring high standards of research output.

‍

Our next evaluation guidebook

Our first two guidebooks have proven to be very popular with evaluation teams, digital teams, suppliers and other people who care about digital impact. They have been listed as best practices in ITTs and other government documents.

Our third guidebook will tackle the latest landmark challenge for the sector: exploiting the clear, technical benefits to drive higher-quality, more efficient and more adaptive work. But at the same time, avoiding the risks posed by increased AI use, especially the erosion of human judgement in an area where judgement is everything.

If you are working in evaluation, and want to talk through how any of this applies to your work, we would welcome the conversation. We’re also keen to hear from anyone who might want to feed into this edition of the guide, to help shape our thinking. Get in touch at johnny@public.io.

This post was written by PUBLIC's specialist evaluation team. For more on our evaluation work, see our guidebooks on evaluating digital projects at evaluation.public.io.

Explore more insights

View All

Article

What does the new Procurement Act mean for social value, sustainability and public-purpose procurement?

Our latest article, focusing on the Procurement Act 2023's shift to "Most Advantageous Tender" (MAT), and its implications for social value, sustainability, and data transparency in public procurement.

Article

How can local areas build innovation clusters?

Discover how local areas can build innovation clusters by aligning national priorities with regional strengths, unlocking economic growth and technological progress.

Article

Building the UK's first smart port

The Department for Transport has published its strategy for the future of the maritime sector - Maritime 2050. Here's how to build the UK's first smart port.

Article

Advice from Experts; Insights from the 'Evaluating Digital Projects: Evaluation Methods' Launch Event

Key takeaways from a panel of experts from the public sector, academia and private sector advisors, on methods for evaluating digital projects.

Article

New Guide on Evaluating Digital Projects: Evaluation Methods for the UK Public Sector

PUBLIC’s latest guide for teams conducting digital evaluations. Building on our first edition earlier this year, this guide brings together practical advice and examples to help public sector teams measure the impact of their digital projects, by using best-in-class research methods from other fields,

Article

Powering Britain’s Clean Energy Transition: The Role of Startups & Sustainable Procurement

PUBLIC's perspective on Labour's mission to make Britain a clean energy superpower and the challenge of turning political promises into tangible progress.

Article

Introducing CTO as a Service: An Innovative Service for Supporting Digital and IT Departments in Local Councils

An innovative solution tailored to support local councils in delivering their digital services.

Article

Unlocking Local Growth: How Labour can meet its vision to power up Britain

Exploring strategies for empowering local authorities.

Article

Measuring the Impact of Digital Projects: Practical Steps for Public Sector Teams

A practical guide for evaluating digital, data and technology projects in the UK public sector; in this article, the guidebook’s authors, Johnny Hugill and Dr. Steffen Triebel explain its context and objectives.

Article

Commercial, Spend & Impact at PUBLIC: Shaping the Future of Procurement

‍In the first of our Expertise Spotlight Series, we sit down with Johnny Hugill, Deputy Director of Commercial, Spend & Impact at PUBLIC, to discuss shaping the UK’s landmark Procurement Act, insights into this transformative legislation, and the opportunities for driving innovation in procurement.

Article

Announcing our partnership with Ignite Procurement: New opportunities for supporting more transparent, intelligent and sustainable procurement

PUBLIC has partnered with Ignite Procurement to help public authorities to deliver more intelligent and sustainable procurement.

Article

How the UK is boosting SME procurement in Defence - and how we can take it further

The Ministry of Defence is the single largest customer for UK industry, committed to £190bn over the next decade. Given this scale and the critical need for capabilities, its vital that MOD procurement is fit-for-purpose. In this blog we explore the current procurement system, its disconnect with SME’s, and how change is coming.

AI and the future of policy evaluation

How can AI support evaluation?

Where are we already seeing the benefits of AI?

How should we consider the risks of AI?

Our next evaluation guidebook

Explore more insights

What does the new Procurement Act mean for social value, sustainability and public-purpose procurement?

How can local areas build innovation clusters?

Building the UK's first smart port

Advice from Experts; Insights from the 'Evaluating Digital Projects: Evaluation Methods' Launch Event

New Guide on Evaluating Digital Projects: Evaluation Methods for the UK Public Sector

Powering Britain’s Clean Energy Transition: The Role of Startups & Sustainable Procurement

Introducing CTO as a Service: An Innovative Service for Supporting Digital and IT Departments in Local Councils

Unlocking Local Growth: How Labour can meet its vision to power up Britain

Measuring the Impact of Digital Projects: Practical Steps for Public Sector Teams

Commercial, Spend & Impact at PUBLIC: Shaping the Future of Procurement

Announcing our partnership with Ignite Procurement: New opportunities for supporting more transparent, intelligent and sustainable procurement

How the UK is boosting SME procurement in Defence - and how we can take it further

Stay in the loop!