Division for Transport exhibits how its AI system avoids bias
The UK Division for Transport (DfT) has labored with Google Cloud and the Alan Turing Institute to construct the Session Evaluation Device (CAT) to analyse citizen suggestions from public consultations.
A report revealed in December 2025 by the Alan Turing Institute, notes that the mission is a part of DfT’s objective to make use of synthetic intelligence (AI) instruments to ship higher effectivity within the division. The CAT instrument offers thematic evaluation of public session suggestions, the place free textual content from citizen submissions are mapped onto explicit themes utilizing giant language fashions (LLMs).
The report’s authors level out that though it’s comparatively simple to make use of LLMs to conduct thematic evaluation, “designing techniques that align with human preferences, have an acceptable stage of human oversight, and have a strong efficiency analysis framework is extra complicated”.
Among the many areas the group targeted on is demographic bias. The report states that whereas CAT doesn’t explicitly use demographic variables in any of the LLM prompts, “an LLM might carry out worse on responses which can be written in poor English or use socio-culturally particular language reminiscent of verbosity or slang”.
On condition that residents self-select to take part in public consultations, the report’s authors mentioned: “We determined it was significantly essential to take a position scarce human assets into assuring the accuracy and high quality of the theme era step.”
They mentioned that having a human-in-the-loop ensures potential AI errors or misinterpretations are recognized, and retains human judgment central to understanding public enter. “Our strategy formally integrates human oversight within the theme overview step and on the evaluation and report-writing stage, the place customers interrogate the CAT-enabled evaluation and choose consultant quotations,” they added.
The CAT makes use of an LLM pipeline to map every particular person response supplied in a public session to a human-validated theme. The mapping course of makes use of what is called a majority-vote system, the place completely different LLMs are requested to categorise a given response within the public session submission to a theme. The theme is barely labeled to a response if a majority of LLMs agree on the identical classification. That is also known as LLM-as-a-judge. In accordance with the report’s authors, the method creates a complete mapping between responses and themes.
Whereas the report states that the CAT was systematically much less correct at mapping themes to responses for particular demographic teams, it additionally famous that the CAT’s design consists of a number of safeguards to mitigate bias, together with exclusion of demographic variables from prompts and the human-in-the-loop overview of all CAT-generated themes.
The report’s authors mentioned: “The human-in-the-loop theme overview course of ensures that the likelihood of extracting all ‘true’ important themes throughout the dataset approaches 100% with human overview, which is how the CAT is utilized in observe.”
CAT is constructed on Google’s Vertex AI platform and makes use of Gemini fashions. In accordance with DfT, it’s able to figuring out and categorising themes from public suggestions in just some hours – a course of that beforehand usually took months. It has already been used to help the evaluation of public responses to the Built-in Nationwide Transport Technique and enhance driving check reserving guidelines.

