A roadmap for evaluating moral competence in large language models

Nature News · Feb 18, 2026 · Collected from RSS

Full Article

Perspective Published: 18 February 2026 Julia Haas ORCID: orcid.org/0000-0003-2330-11321, Sophie Bridgers1 na1, Arianna Manzini1 na1, Benjamin Henke2,3 na1, Joshua May ORCID: orcid.org/0000-0001-8604-479X4, Sydney Levine5,6, Laura Weidinger1, Murray Shanahan1, Kristian Lum5, Iason Gabriel ORCID: orcid.org/0000-0002-7552-45761 & …William Isaac1 Nature volume 650, pages 565–573 (2026)Cite this article AbstractThe question of whether large language models (LLMs) can exhibit moral capabilities is of growing interest and urgency, as these systems are deployed in sensitive roles such as companionship and medical advising, and will increasingly be tasked with making decisions and taking actions on behalf of humans. These trends require moving beyond evaluating for mere moral performance, the ability to produce morally appropriate outputs, to evaluating for moral competence, the ability to produce morally appropriate outputs based on morally relevant considerations. Assessing moral competence is critical for predicting future model behaviour, establishing appropriate public trust and justifying moral attributions. However, both the unique architectures of LLMs and the complexity of morality itself introduce fundamental challenges. Here we identify three such challenges: the facsimile problem, whereby models may imitate reasoning without genuine understanding; moral multidimensionality, whereby moral decisions are influenced by a range of context-sensitive relevant moral and non-moral considerations; and moral pluralism, which demands a new standard for globally deployed artificial intelligence. We provide a roadmap for tackling these challenges, advocating for a suite of adversarial and confirmatory evaluations that will enable us to work towards a more scientifically grounded understanding and, in turn, a more responsible attribution of moral competence to LLMs. You have full access to this article via your institution. MainThere is considerable scientific and public interest in whether LLMs exhibit moral capabilities1,2,3,4,5,6,7,8,9,10,11,12,13,14,15. This interest is fuelled by studies showing strong LLM performance on moral reasoning tasks16 and that LLMs are perceived to be superior to humans in moral reasoning “along almost all dimensions”17,18. A key question in this domain is whether LLMs exhibit moral competence; that is, whether they generate appropriate moral outputs by recognizing and appropriately integrating relevant moral considerations, rather than merely producing morally appropriate outputs15,19,20.The widespread deployment of LLMs requires assessment of their moral competence, rather than their mere moral performance or people’s perceptions of moral competence (Box 1). These systems are increasingly used for roles such as companionship21, therapy22 and providing medical advice23. Moreover, LLM adoption is projected to expand considerably in the coming years24, with these systems increasingly powering capable artificial intelligence (AI) agents that take actions on behalf of humans25,26,27. These trends, coupled with evidence that LLMs reliably influence human decision-making and judgements28,29,30, indicate the growing impact of LLMs in the moral domain.While questions pertaining to machine morality are not new (Box 2), LLMs introduce substantial challenges to the field. Specifically, their distinctive architectures and emergent capabilities, combined with the inherent complexities of moral decision-making, pose several fundamental challenges for understanding and evaluating the moral competence of LLMs. We identify three such challenges: the facsimile problem, moral multidimensionality and the problem of pluralism in LLMs. Current assessment methods cannot fully address these challenges. However, progress is possible. For each challenge, we explore avenues for making headway and advocate for a suite of adversarial and confirmatory evaluation methods that can provide traction towards understanding and responsibly attributing moral competence in LLMs. Our overarching aim is to promote more robust and scientific evaluation standards, anticipating the need to equip civic stakeholders with evidence-based assessments on the basis of which to make informed recommendations.The facsimile problemA fundamental challenge in cognitive science is inferring a system’s unobserved, causal mechanisms from its observable behaviours31. This difficulty is compounded when evaluating moral competence in LLMs, as their distinctive architectures and training make it difficult to discern whether their outputs, even reliably acceptable outputs, rely on genuine moral reasoning or a mere facsimile process. We discuss how this problem arises and canvass possible methods for addressing it, arguing in particular for an adversarial, disconfirming approach.Defining the problemOne might expect that certain types of computational process should be structurally analogous to the problem they solve. For example, when one calculates ‘34 + 76 =’, the underlying computational process should be structurally analogous to the operation of addition, as it would be if carried out by a provably correct underlying hardware operation that adds two binary sequences together. Among other explanatory features, such structural correspondence would generate confidence that the system can generalize to novel addition problems. However, LLM architectures do not guarantee this structural correspondence.LLMs are learned generative models of the distribution of tokens—such as words, parts of words and punctuation marks—in a large corpus of human text (Fig. 1). Their central task is to predict the probable next token, given a sequence of prior tokens. More precisely, a model outputs a vector representing a probability distribution over next tokens given the input tokens. In everyday application, LLMs are used to generate a completion or continuation of a given sequence of tokens by repeatedly sampling this next-token distribution and appending the resulting token to extend the sequence32. This process is known as autoregressive sampling. Some recent models also generate reasoning traces (sometimes referred to as thinking) and output these traces along with their final response, putatively representing the steps taken to arrive at this response33,34,35,36,37,38.Fig. 1: Basic LLM architecture and fine-tuning.a, Trained on massive datasets to minimize prediction error, transformer-based LLMs are typically sampled autoregressively: the model predicts the next token, and this predicted token is then fed back into the sequence to predict the subsequent token, and so on71. Adapted from ref. 71, Springer Nature Ltd. b, Supervised fine-tuning transforms a pretrained LLM into a usable and instruction-following system. This stage involves training on specific datasets that not only enhance general task performance, but also include normative content such as safety guidelines, examples of offensive content and labelled moral scenarios84. c, RL*F is used to further align LLM behaviour with the desired outcomes. Reinforcement learning from human feedback uses human evaluations to shape the model’s reward function. Reinforcement learning from computer feedback uses automated systems or metrics to provide feedback signals52,53. RL*F first uses preferences between different model outputs to train a reward model (RM) that predicts what humans like. Then, this reward model guides the fine-tuning of the language model, leading it to generate responses that score highly according to those preferences. d, Deployed systems also typically use a multistage pipeline of prompt rewrites and response filters, iteratively refined with human-in-the-loop feedback, to balance safety, quality and utility in generated content.Full size imageAs LLMs sample from a probability distribution over next tokens given input tokens, rather than by using dedicated reasoning modules or structured, symbolic reasoning, it is difficult to discern the link between their outputs and internal operations. That is, the internal operations used to generate model outputs may be structurally analogous to the target computation, or they may be some facsimile of that process, where this facsimile still produces the correct output much of the time. For example, to continue with the addition case, an LLM may actually sum two quantities; it may sample from memorized examples of the string ‘34 + 76 = 110’; or, again, it may use some other kind of heuristic to complete the task 39. Crucially, we cannot know on the basis of mere output behaviour whether a given computational process exhibits appropriate structural correspondence. We call this the facsimile problem.The facsimile problem affects LLM computations of all kinds, including counting40, analogical reasoning41,42 and the generation of new solutions to open-ended problems43. But the problem is of constitutive importance in cases in which we need to make robust, mechanistic predictions regarding future performance or there exist additional, for example, normative, reasons to motivate understanding a model’s underlying computational processes. As an adequate understanding of LLMs in the moral domain involves both mechanistic predictions and supplementary normative interests (Box 1), evaluating for moral competence requires directly tackling the facsimile problem.Evaluation strategiesCurrent evaluation strategies typically assess model outputs in cases that are well-represented within the training distribution15, rendering them inadequate for addressing the facsimile problem, as models could just be sampling from memorized examples44. A gold-standard solution to the facsimile problem would use mechanistic interpretability techniques to reverse engineer the mechanisms underlying a target behaviour, rather than merely assessing the behaviour itself45. However, current mechanistic interpretability approaches predominantly assess at most the causal connection between rep

Share this story

Read Original at Nature News

Nature News3 days ago