25 March 2026

AI and the future of policy evaluation

At PUBLIC, we care deeply about evaluating the impact of technology projects, products and services. We don’t think this should be treated as an after-thought, but should be a key part of the delivery of any digital project, from the beginning.

Two years ago, PUBLIC published our first guidebook on evaluating digital projects in the public sector. We covered the fundamentals, such as Theories of Change, method selection, and how to evaluate technology tools in fast-changing environments, while complying with the Treasury’s guidance (like the Green and Magenta Books). We gave the example of a fictional AI consultation tool - ‘CivicAI’ - to bring our approach to life. Shortly afterwards, i.AI announced that they had been building a similar tool - ‘Consult’

We then published a second edition, going deeper into economic evaluation methods. This guide focused on translating traditional economic frameworks into tools that can be practically used to support digital and data teams. Alongside these guides, we have now conducted independent evaluations of 20+ UK Government digital projects, working with the Evaluation Task Force, MHCLG, DfT, MOJ, Companies House, and more.

We are now working on our third edition. This will explore the role of AI in monitoring and evaluation, and how evaluation teams can start safely and practically start using these tools.

How can AI support evaluation?

We think that there is a significant gap between what is now technically possible with AI in evaluation, and what is actually happening. A recent OECD report exploring AI adoption across governments found that across 200 AI use cases in member states, evaluation had the fewest, with just five. 

To support the effective evaluation of policy and public sector projects, there is an opportunity for evaluation practitioners to harness the productivity gains of AI whilst building the guardrails necessary to manage it safely and securely.

The most important distinction for evaluators using AI is between a standard large language model (LLM) and an AI agent. An LLM, the kind of AI behind ChatGPT or Claude, is well suited to one-off tasks: “summarise this document”, “help draft this section”, “explain this method”. An agent goes further. It can execute multi-step workflows autonomously: reading files, writing and running code, checking its own outputs, iterating. 

For M&E, which is inherently a multi-step process, agentic capabilities open up new and powerful possibilities. AI offers immense potential to make evaluations quicker and more robust, whilst being cognisant that we must be highly discerning about what we automate and what we keep as human tasks. 

Where are we already seeing the benefits of AI? 

Across the M&E lifecycle, we are already seeing real, deployable applications of AI tools.

  • AI-assisted evidence synthesis is probably the most mature area. Tools can now search, screen, and summarise bodies of literature at a scale that would take human teams weeks. For evaluation teams scoping a new programme area, or interested in exploring what some other field could say about their topic, this is genuinely useful today.
A recent example of this is the development of systems like InsightAgent, a multi-agent framework designed for complex systematic reviews. Researchers demonstrated that this tool could partition a massive amount of literature, read and synthesise findings, and draft a rigorous review in just 1.5 hours - a process that traditionally takes months to complete manually. Researchers could also visually monitor the AI's reading trajectory, adjust its inclusion criteria, and verify its sources in real-time.

  • AI-led qualitative interviews - including voice - have been shown to generate substantially richer responses than conventional open text fields. For public sector evaluations, the possibility of running qualitative research at a fraction of the cost is a meaningful shift. Similarly, these practices are effective where there are multiple layers of governance  - such as evaluation framework development and qualitative evaluations of ‘unmonetisable’ outcomes, as per the Green Book.
For example, PUBLIC recently utilised Salomo to conduct user research for a major public sector project. Traditionally, gathering and synthesising user research at this scale would take a team of multiple researchers many months to complete. However, by leveraging Salomo's agentic capabilities, a team of just two researchers was able to process, code, and extract insights from 100 interviews in less than a week.

  • Getting to concrete outputs and models more quickly. Analysis and reporting workflows are starting to allow evaluators to go from a research question to a documented, reproducible output - with code, findings, and visualisations - in a fraction of the time previously required.
For example, AI Scientist-V2, is a system capable of automating the scientific research lifecycle. Given a high-level prompt, the agent autonomously formulates hypotheses, writes and debugs experiment code, visualises data, and drafts a complete manuscript in under 15 hours. It also recently produced a research paper that successfully passed a double-blind peer review.

While public sector policy evaluation has its own unique complexities and stakeholder dynamics, the implication is clear. These are tools that can handle the heavy mechanical execution - running the econometrics, generating charts, and drafting technical annexes - freeing up evaluators to focus on the harder interpretive questions and policy implications.

Finally, two recent projects give a general sense of where the cutting edge capabilities look like. The Autonomous Policy Evaluation project at the University of Zurich is building systems that generate research questions, design evaluations, and run replications, not just as a curiosity, but as a serious research programme. And Claude Blattman, built by Chicago political economist Chris Blattman, is a practical guide that helps researchers, including those with no technical background, start using AI agents to improve and accelerate their work. Both projects are noteworthy, not because they represent where most evaluation teams are, but because they represent where the frontier already is.

How should we consider the risks of AI?

Conversations about AI often focus on its current risks and limitations - like hallucinations, data privacy, and the risk of errors going undetected. However, there is a less-discussed risk that deserves more attention, particularly for evaluation teams: the quiet erosion of human judgment through over-delegation. This is a risk we’re keen to dive into, and design ways for evaluation teams to avoid.

For example, a randomised experiment by researchers at Anthropic found that software engineers who used AI assistance scored significantly lower on skill assessments than those who did not. This is not because the AI made them worse, but because certain patterns of use prevented them from developing the judgment they needed to catch errors in AI outputs. 

Unlike structured operational tasks where a failure is evident, the risks of deskilling in evaluation are often invisible to a non-practitioner. A subtly flawed estimation or poor qualitative coding by an AI agent will still produce a highly convincing, authoritative-looking report. Complex impact evaluation or financial models can be created with subtle, unnoticeable errors, especially when overseen by non-technical teams.

We want to support evaluation teams through these challenges. For example, by using multi-agent review and flagging systems, or by designing effective ‘spot-check’ approaches, or by building methodologies to ensure appropriate human sampling for some of the research. While the evaluation and research sector is getting to grips with AI, we will help to produce step-by-step frameworks for teams to use AI tools safely, ensuring high standards of research output.

Our next evaluation guidebook

Our first two guidebooks have proven to be very popular with evaluation teams, digital teams, suppliers and other people who care about digital impact. They have been listed as best practices in ITTs and other government documents. 

Our third guidebook will tackle the latest landmark challenge for the sector: exploiting the clear, technical benefits to drive higher-quality, more efficient and more adaptive work. But at the same time, avoiding the risks posed by increased AI use, especially the erosion of human judgement in an area where judgement is everything.

If you are working in evaluation, and want to talk through how any of this applies to your work, we would welcome the conversation. We’re also keen to hear from anyone who might want to feed into this edition of the guide, to help shape our thinking. Get in touch at johnny@public.io.

This post was written by PUBLIC's specialist evaluation team. For more on our evaluation work, see our guidebooks on evaluating digital projects at evaluation.public.io.

Photo by the author

Johnny Hugill

Co-Managing Director

Photo by the author

Angelo Leone

Associate

Photo by the author

Lucas Pollock Muguruza

Associate

Partners

No items found.

Explore more insights

Stay in the loop!

Sign up to our monthly newsletter to get a snapshot of PUBLIC’s impact across the public sector, thought-leadership from our experts, and opportunities to get involved!