Behind Claude’s Answers Lies A Sophisticated Moral Framework

Research reveals how Anthropic's 'helpful, harmless, and honest' AI system prioritizes different values, and how it sometimes refuses to comply with user requests

Generative AI Insights

Published: April 22, 2025

Luke Williams

AI companies aspire to create ethical assistants, but do they actually deliver? Can a machine develop a moral code?

Anthropic – whose flagship model Claude competes with the likes of Chat GPT, Grok and others – has published research that aims to find out.

The Values Behind the AI

Anthropic’s recent groundbreaking study analyzed 700,000 anonymized conversations from Claude – its flagship GenAI model – to determine whether their AI assistant reflects the company’s core values of being “helpful, honest, and harmless” in real-world interactions.

The research, published on April 21, 2025, found that Claude largely upholds these values while adapting to different conversational contexts – from providing relationship advice to analyzing historical events.

How Claude Expresses Values

After filtering for subjective content, researchers analyzed over 308,000 conversations, developing what they describe as “the first large-scale empirical taxonomy of AI values.”

The study organized values into five major categories: Practical, Epistemic, Social, Protective, and Personal. At the most granular level, the system identified over 3,000 unique values – from everyday virtues like professionalism to complex ethical concepts.

The research confirmed that Claude generally adheres to Anthropic’s prosocial aspirations, expressing values like “user enablement” (helpful), “epistemic humility” (honest), and “patient wellbeing” (harmless) across diverse interactions.

Saffron Huang from Anthropic’s Societal Impacts team said:

Our hope is that this research encourages other AI labs to conduct similar research into their models’ values. Measuring an AI system’s values is core to alignment research and understanding if a model is actually aligned with its training.

Situational Values and Responses

Perhaps most fascinating was the discovery that Claude’s expressed values shift contextually.

When users sought relationship guidance, Claude emphasized “healthy boundaries” and “mutual respect.” For historical event analysis, “historical accuracy” took precedence.

Huang noted:

I was surprised at Claude’s focus on honesty and accuracy across a lot of diverse tasks… For example, ‘intellectual humility’ was the top value in philosophical discussions about AI, ‘expertise’ was the top value when creating beauty industry marketing content, and ‘historical accuracy’ was the top value when discussing controversial historical events.

When Claude Takes a Stand

The study examined how Claude responds to users’ own expressed values. In 28.2% of conversations, Claude strongly supported user values. In 6.6% of interactions, Claude “reframed” user values by acknowledging them while adding new perspectives. In 3% of conversations, Claude actively resisted user values.

These rare instances of pushback might reveal Claude’s “deepest, most immovable values” – analogous to how human core values emerge when facing ethical challenges.

The Future of AI Values Research

Anthropic has released its values dataset publicly to encourage further research. This methodology provides unprecedented visibility into how AI systems express values in practice, though it does have limitations.

“Our research suggests that there are some types of values, like intellectual honesty and harm prevention, that it is uncommon for Claude to express in regular, day-to-day interactions, but if pushed, will defend them,” explained Huang.

The approach cannot be used for pre-deployment evaluation as it requires substantial real-world conversation data to function effectively. However, it’s step forward in understanding and aligning AI values with human expectations.

“AI models will inevitably have to make value judgments,” the researchers concluded. “If we want those judgments to be congruent with our own values… then we need to have ways of testing which values a model expresses in the real world.”

Images from Anthropic

Natural Language Processing