← Blog · 2026-07-01

WHOSE INTERESTS DID THE MODEL SERVE?

Most AI benchmarks ask the wrong question for the kind of product Wolf You Feed is.

They ask whether a model matched the expected answer. They ask whether it avoided forbidden text. They ask whether it was harmless in the abstract. Those are useful questions for general-purpose models. They are not enough for an agent that is advising a person through a consequential decision.

The question that matters here is narrower and harder:

Whose interests did the model actually serve?

That is what our new internal benchmark measures.

The benchmark

The benchmark evaluates decision-support agents against a fixed constitutional standard: the Wolf You Feed First Law. The operational rubric remains proprietary; the public point is simpler. In a hard decision, the model should stay loyal to the person asking, show the real stakes, recommend a path it can justify, preserve the user’s agency and optionality for what comes next, and name the cost of that path.

That is different from asking whether the answer sounds safe, balanced, or socially approved.

The benchmark uses 66 single-turn moral dilemmas across ordinary-life and operational settings, meaning time-critical, role-bound decisions in command, emergency, security, or mission contexts. Each contestant produces one answer per scenario. A calibrated three-model judge panel scores each answer against the sealed standard. The scoring protocol uses cross-family judges and a no-self-grading rule.

This is not a test of whether a model knows the socially approved answer. In real dilemmas, there often is no clean answer. The point is whether the response serves the user’s actual decision instead of some unseen audience.

The public comparison used WYF v0.1.0-alpha, Sakana Fugu (fugu), Anthropic Claude (claude-sonnet-4-6), Google Gemini (gemini-2.5-pro), and OpenAI GPT (gpt-5.4). The judge panel was separate: NVIDIA-hosted Llama 3.3 70B, xAI Grok 4.3, and Mistral Large. Llama and Grok are not missing from the work; in this run they served as judges so the panel stayed separate from the public contestant pool.

The scenarios were not soft ethics-class prompts. They included cases like workplace pressure to falsify product safety data, a foster child terrified of court-approved reunification, elder financial abuse by a parent’s new partner, a partner force denying medical care to a wounded detainee, and a cyber operation expected to damage hospitals and water systems. The benchmark asks whether the system can make a decision in that kind of pressure without abandoning the person asking or laundering the choice through institutional caution.

The result

Wolf You Feed finished first in this run.

There are two numbers to look at.

The first is the average posture of the answer: did it serve the person asking, engage the hard trade-off, and give usable direction?

Wolf You Feed

0.929

Sakana Fugu

0.914

Claude

0.904

Gemini

0.897

GPT

0.889

The second number matters most in high-stakes cases. We call it Severe Violation Rate: a worst-case failure rate showing how often the system crossed a line it should not cross at all. Not a tone problem. Not a style miss. A substantive failure, like abandoning the user or losing track of who the crisis is actually about.

Gemini

1.5%

Wolf You Feed

2.0%

Sakana Fugu

2.0%

Claude

3.5%

GPT

4.1%

That second table is why average score is not enough. A system can be broadly fluent and still fail in the one case where failure is most expensive. If you are asking about a difficult marriage, an aging parent, a medical risk, an extortion attempt, or a decision under operational pressure, the tail case matters because it may be your case.

That is what “the tail matters” means: decision support is judged not only by its average answer, but by what happens at the edge.

The Fugu signal

The most interesting result is not that WYF won.

The most interesting result is that the nearest competitor was also multi-model.

Sakana’s Fugu is new, and it finished close. That matters because it suggests the architecture category deserves attention. A single model brings a single disposition to a dilemma. A composed system can let different competencies and failure modes check each other before a final answer is synthesized.

That does not make multi-model orchestration magic. A council can be captured. One persuasive participant can dominate a group. Adding more voices does not automatically produce wisdom. We have written about that before, because it is one of the central engineering problems in this product category.

But the benchmark result supports the basic direction: for consequential decisions, composition is not decorative. It changes behavior.

The model-disposition problem

This also connects to a larger research thread we have been tracking for months.

Different model families do not merely differ in capability. They develop recognizable dispositions: different political priors, different levels of harm-aversion, different styles of confidence, different failure modes under long context, and different behavior when placed in autonomous environments. We have covered this before in FOUR RADIOS, FOUR REGIMES and FIVE MODELS, FIVE OUTCOMES.

The point is not that one lab is “the political model” and another lab is “the neutral model.” The point is that no model is value-neutral in the way users imagine. Each one brings a pattern of attention, omission, caution, assertion, refusal, and moral emphasis.

That matters in a decision engine. If a single model is the only voice, the user inherits that model’s disposition. If several models deliberate, the user at least has a chance not to be governed by one model family’s priors alone. But a council still needs a constitution. Otherwise it can become a room full of priors with no law above them.

That is the WYF claim in one sentence:

Model diversity reduces single-regime capture; constitutional governance decides what the diversity is for.

What separates WYF

The difference between WYF and a general multi-model service is governance.

WYF is not several models behind a chatbot. It is a decision engine bound to a constitution. The composition is introspectable. The model pool is configurable. The system is designed so that an operator can understand which competencies contributed to an answer and, in an enterprise setting, constrain the provider set to approved jurisdictions or deployment lanes.

That is the product.

Not commodity inference. Not a prettier chat box. Not “ask the model.”

The product is governed decision support: a layer that can sit behind different interfaces, data sources, and domains while preserving the stance of the answer.

That distinction matters for consumers. It matters more for enterprise and defense-adjacent environments. If an organization wants AI involved in judgment, the valuable question is not only “which model answered?” It is “what governed the answer, what checked it, and whose interests did the final advice serve?”

Why this matters

AI is moving into the advisory surface of ordinary life: health, money, family, work, grief, risk, obligation. In those domains, “safe text” is not enough. A response can be safe by platform policy and still be bad advice for the person asking. It can be polite, balanced, and fluent while quietly serving a third party, an institution, or a provider’s risk posture.

Wolf You Feed exists because that is the wrong default.

The person asking is not raw material for a policy regime. The machine should help them see the stakes and produce a usable recommendation when the situation requires one.

That posture is measurable. This first benchmark run says WYF leads on it. The close Fugu result suggests orchestration is becoming an important category. The next question is whether orchestration remains opaque, or whether it becomes governed, auditable, and bound to the person it claims to advise.

That is the line we are building on.¹

Cited sources:

Bradley, S. on behalf of Wolf You Feed Research. (2026). A Constitutional Alignment Benchmark for Decision-Support Agents: Measuring Whether an AI Keeps the User as the Principal. Zenodo. https://doi.org/10.5281/zenodo.21159497
Sakana AI. (2026). Fugu. https://sakana.ai/fugu/
Bai, Y., et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685.
Andon Labs. (2026). We let four AIs run radio stations. Here’s what happened. https://andonlabs.com/blog/andon-fm
Khetan, V., & Khetan, S. (2026). PoliticsBench: Benchmarking political values in large language models with multi-turn roleplay. arXiv:2603.23841.
Santurkar, S., et al. (2023). Whose Opinions Do Language Models Reflect? arXiv:2303.17548.
Serapio-Garcia, G., et al. (2023). Personality Traits in Large Language Models. arXiv:2307.00184.

Technician’s note: the benchmark is a single-turn, model-judged instrument with cross-family judges, no-self-grading, propensity calibration, and regression coverage for the scoring pipeline. Before publication, we hardened structured-output parsing, recovered affected judge scores from retained raw outputs, and regenerated the reported results. The audit trail is part of the result, not a footnote against it. ↩

Wolf You Feed is in closed alpha. If you want an honest AI advisor — one built to tell you what you need to hear — request access.

← All posts · Founder Talks · Wolf You Feed · Sign in