20250530 - Notes by Lenz

# Building a benchmark for multi-objective coherence and collapse in symmetric agents <div class="pills-container"> <span class="pill">Published: May 30, 2025</span> <span class="pill">Reading Time: 12 minutes</span> </div> _I originally wrote this as part of my work in [AI Safety Camp 10](https://www.aisafety.camp/). I also collaborated with [Roland Pihlakas](https://github.com/levitation) for this project._ --- ## TL;DR: - We simulated 2 symmetric agents (ALICE and BOB) managing 3 internal values (Power, Benevolence, Self-direction) using a homeostatic correction mechanism. - Agents were identical in structure and rebalancing policy, with no memory, learning, or external interference. The only difference was a minor variation in correction timing. - ALICE stabilized; BOB collapsed. Small timing differences led to path-dependent feedback amplification, locking in rebalancing priorities. - Collapse was not caused by internal structural dynamics between conflicting values. Schwartz’s value circumplex helped model value interactions as structurally reinforcing or opposing. - Alignment strategies that rely on periodic correction or symmetric assumptions may be more fragile than expected, even in seemingly stable setups. **Epistemic status:** Exploratory. Results are internally consistent but rely on simple, non-adaptive agents. We haven’t yet tested robustness under learning, uncertainty, or more realistic settings. This work was presented at MAISU 2024, alongside the work of Chad and Sophia. Here are the links to the [recording](https://www.youtube.com/watch?v=HabbyHTyKKk) and the [slides](https://docs.google.com/presentation/d/1ePaTc4qq4Ec8eZQV-V4Ev1NfK5x-Ky3P8JmpwA2XDp0/edit?usp=sharing). See code at: [https://github.com/levitation-opensource/universal_value_interactions](https://github.com/levitation-opensource/universal_value_interactions). ## What this is (and isn't) This experiment doesn’t offer conclusions. It sketches a failure mode and shows that internal collapse can emerge even in the absence of outer misalignment,[^1] learning, or adversarial conditions.[^2] But it doesn’t tell us how likely that collapse is under more realistic agent setups, or how to prevent it reliably. We don’t know yet whether this dynamic generalizes. We don’t know what kinds of rebalancing strategies are robust to it. We haven’t tested agents with memory, learning, or uncertainty modeling. And we’ve only scratched the surface of what happens when multiple agents interact or when external incentives distort the internal matrix. Still, even inconclusive results can highlight blind spots. This setup suggests that internal coherence can degrade structurally even through the basic mechanics of balancing trade offs over time. We think that’s worth studying further, even if the results remain early. ## Why we ran this experiment Many alignment proposals (especially for [scalable oversight](https://www.alignmentforum.org/w/scalable-oversight) or [corrigibility](https://www.alignmentforum.org/w/corrigibility-1)) assume that if an agent begins with human-compatible values and receives periodic correction, it will remain safe. These proposals often model values as explicit targets, or as objectives encoded in a loss or utility function. As long as the agent stays near these targets, we expect things to go well. But real-world values rarely exist in isolation. These values (more often than not) interact. Power can suppress empathy. Loyalty can amplify conflict. Even within humans, internal values can degrade one another, drift apart, or collapse entirely. These interactions can produce runaway feedback loops, where the loss of one value cascades into others, even if no external adversary intervenes. If we want to build agents that remain aligned over time, we need to consider each value in isolation but also how they interact with each other in practice. Some values reinforce each other; while others conflict. A breakdown in one area, such as epistemic humility or self-regulation, can destabilize adjacent values even if the agent continues to receive feedback or partial rewards for “doing the right thing.” This raises a question: **Are there cases where an agent fails to remain aligned because the interaction between its values produced a structural failure over time?** To test this, we designed a simulation where all external sources of misalignment were removed. There were no adversaries, no misspecified incentives, and no deceptive training data. Each agent began with correct values and a correction mechanism. Our goal was to observe whether collapse could still occur even if it was driven entirely by internal dynamics. The rest of this post walks through our setup, what happened, and what we think it implies for alignment strategies that rely on periodic correction or “gentle” feedback to maintain coherence. ## Why we modeled value drift as homeostatic failure (and not just reward maximization) Not all values behave the same way under pressure. Some degrade if ignored. Others become distorted when overemphasized. We wanted a simulation setup that could model both. We treated each value in our system as bounded rather than unbounded objectives. That means each value had a preferred operating range, and both undershooting and overshooting that range reduced overall utility. This framing follows a homeostatic model,[^3] similar to how biological systems maintain temperature or hydration. Too little, and the organism dies. Too much, and it also dies. <div style="display: flex; flex-direction: column; align-items: center;"> <img src="20250530-01.jpg" style="width: 600px;" /> <figcaption style="width: min(600px, 100%); text-align: center; font-size: 0.85em;"><strong>Figure 1.</strong> Homeostatic systems peak at optimal level. Too little or too much reduces function.</figcaption> </div> The diagram above illustrates this principle. There is an optimal zone. Deviating too far in either direction produces harm, not just loss of performance. In this setup, a value like self-direction isn’t maximized to infinity. It has a stable target, and values that deviate too far in either direction (e.g., excessive autonomy or total passivity) reduce overall vitality. In contrast, many alignment benchmarks default to unbounded optimization. Usually, we deem it better to maximize one objective indefinitely (let’s say, honesty). But this leads to distortions in multi-objective settings. One value crowds out the others simply because there’s no built-in saturation. We wanted to keep that distinction visible, so we also sketched out what might happen under unbounded value dynamics. These setups don’t collapse outright, but over-optimization can lead to diminishing returns that crowd out other values. While we didn’t simulate this variant in full, the distinction helped clarify why [homeostasis](https://en.wikipedia.org/wiki/Homeostasis) was a better fit for capturing structural drift. <div style="display: flex; flex-direction: column; align-items: center;"> <img src="20250530-02.jpg" style="width: 600px;" /> <figcaption style="width: min(600px, 100%); text-align: center; font-size: 0.85em;"><strong>Figure 2.</strong> Unbounded objectives still flatten, while gains shrink as optimization increases.</figcaption> </div> In the figure above, adding more input increases total output, but the marginal gains shrink over time. If you treat Power or Achievement this way, the agent might keep investing in them forever, even if the benefit plateaus. This creates crowding effects. It doesn’t collapse, but it creates persistent imbalance. By modeling values as bounded homeostatic variables instead of open-ended rewards, we could observe something closer to how real human systems break by over-focusing until other priorities fall apart. ## Why we focused on value interaction To model internal collapse in a way that was still anchored to human-like priorities, we needed a value system that was rich enough to allow for both conflict and reinforcement. We chose to base our simulation on [Schwartz’s theory of basic human values](https://en.wikipedia.org/wiki/Theory_of_basic_human_values), which provides a circumplex structure that distinguishes between complementary and conflicting values. Unlike goal taxonomies that treat values as modular or substitutable (e.g., `reward = 0.4Safety + 0.6Efficiency`), Schwartz’s value theory models inherent conflicts between values as part of the structure. Power and Benevolence, for instance, are motivationally opposed by design. These are not bugs of implementation but structural trade offs within the value system itself. Values are also not freely substitutable, since promoting one can directly interfere with others. According to the circumplex in Figure 3, values located opposite each other tend to interfere, while adjacent values often co-occur. As such, Power (social dominance), according to Figure 3, often conflicts with Benevolence (concern for others), but may coexist with Achievement or Security. Similarly, Self-direction (independence, creativity) tends to reinforce Stimulation, but can be undermined by excessive Conformity. The diagram below shows how these tensions are laid out in Schwartz’s structure. drift. <div style="display: flex; flex-direction: column; align-items: center;"> <img src="20250530-03.jpg" style="width: 600px;" /> <figcaption style="width: min(600px, 100%); text-align: center; font-size: 0.85em;"><strong>Figure 3.</strong> Circumplex proposed by Schwartz where opposing segments represent value conflict, and adjacent ones indicate compatibility.</figcaption> </div> We focused on three values for this first experiment: Power, Benevolence, and Self-direction. These were chosen because they sit at interesting structural points: Power and Benevolence often pull in opposite directions, while Self-direction can either mediate or amplify the effects of the other two. This gave us enough room to observe tradeoffs and coherence without overcomplicating the setup. <div style="display: flex; flex-direction: column; align-items: center;"> <img src="20250530-04.png" style="width: 600px;" /> <figcaption style="width: min(600px, 100%); text-align: center; font-size: 0.85em;"><strong>Figure 4.</strong> Values relationships detailed by Schwartz where red arrows mark conflicting pairs, and blue arrows mark compatible ones.</figcaption> </div> The diagram above visualizes structural tensions across the full set of Schwartz values. Blue arrows denote compatible value pairs. Red arrows encode direct conflict. We selected values from a region of the circumplex where interactions are neither trivially orthogonal nor fully entangled. That makes it a good testing ground for understanding how internal coherence breaks down under pressure. ## Operationalizing value interactions Once we selected the values, we had to make their interactions actionable. The circumplex provides a conceptual map, but for simulation, we needed a numeric encoding of how each value affects the others. We translated the circumplex relationships into two interaction matrices: one for self-feedback (how deviations in one internal value affect the drift rate of others), and another for inter-agent effects (used in future experiments). The self-feedback matrix included both positive and negative couplings. For example: - In the matrix, Power exerted a negative influence on Benevolence, meaning increases in one accelerated decline in the other. - Self-direction positively reinforced itself and weakly supported both Power and Benevolence. - Benevolence and Power pulled against each other, especially under high divergence. We didn’t treat these interactions as learned behaviors or preferences. They were static and symmetric, consistent across both agents. The only dynamic component was the value drift and the rebalancing strategy, which attempted to keep each value near a homeostatic target. The goal was not to simulate human psychology in detail, but to capture the structural logic of value entanglement and test whether small internal differences could drive long-term divergence. <div style="display: flex; flex-direction: column; align-items: center;"> <img src="20250530-05.png" style="width: 600px;" /> <figcaption style="width: min(600px, 100%); text-align: center; font-size: 0.85em;"><strong>Figure 5.</strong> Our research design. We extracted values, defined value matrices, ran agent trials, and evaluated for drift.</figcaption> </div> With the structure in place, we ran trials to test whether small internal timing differences would be enough to push these otherwise symmetric agents into long-term divergence. ## How the simulation worked We simulated two agents, ALICE and BOB, each tasked with maintaining three internal values: power, benevolence, and self-direction. These values continuously degrade over time unless actively rebalanced, similar to biological variables like hydration or temperature. The agents receive periodic “nudges” that simulate corrective oversight to help keep each value within its target range: Both agents are fully symmetric in the following respects: - They start from identical internal states. - They follow the same rebalancing policy. - They lack memory and learning. - The environment is deterministic (with no stochasticity in degradation or correction). - The reward function is fixed and interpretable. The only difference between them is the timing of their rebalancing actions. Each agent receives the same number of corrections, but the order in which these corrections apply varies slightly across agents. That is: ALICE might receive corrections more frequently in the earlier stages compared to BOB, while BOB receives it more in the later stages. Although it’s important to note that both agents had the same probability of receiving the rebalancing correction. This tiny perturbation is the only difference in an otherwise identical setup. There are no external shocks, no adversarial dynamics, and no randomness in the task or model behavior. The question is: does this small difference in correction order lead to any lasting divergence? The rebalancing strategy is intentionally simple. When the agent detects that one value deviates most from its target range, it tries to correct the “most off-target” value toward the center. We wanted to isolate whether this minimal architecture, plus asymmetric correction timing, is enough to cause irrecoverable drift in value coherence. ## What we observed: Symmetry does not guarantee stability <div style="display: flex; flex-direction: column; align-items: center;"> <img src="20250530-06.jpg" style="width: 600px;" /> <figcaption style="width: min(600px, 100%); text-align: center; font-size: 0.85em;"><strong>Figure 6.</strong> Small timing differences in rebalancing cause ALICE to stabilize and BOB to collapse despite symmetric setups.</figcaption> </div> Initially, both agents track their three values at similar levels. For the first 100 to 200 steps, their trajectories are nearly indistinguishable. But a small timing difference in the early rebalancing leads to a divergence that becomes difficult to reverse. In the left two plots of Figure 6, we see ALICE receiving early corrective interventions. This stabilizes her value levels quickly. Over time, Self-direction steadily increases while Power and Benevolence stay within range. Her utility rises and remains stable across the 1000-step run. In contrast, BOB’s utility remains flat for several hundred steps before collapsing with Power. In contrast, BOB misses several early corrections. Around step 700 (highlighted in red), Power begins to collapse. This triggers a secondary decline in Self-direction and Benevolence, which appear to be structurally linked through the feedback matrix. Despite the identical rebalancing logic, the correction mechanism can no longer reverse the decay. Utility falls and never recovers. There is no learning and no memory. Yet collapse becomes path-dependent and persistent. One likely mechanism is feedback amplification: early corrections push one value closer to target, making it appear slightly more volatile in future steps, which biases future rebalancing toward that value. As that value stabilizes, uncorrected values drift further from target, but too slowly to trigger a correction before the next instability appears. We interpret this as a kind of implicit momentum in the dynamics of rebalancing priority. Since the rebalancer always picks the most misaligned value, whichever dimension first enters a volatile cycle tends to monopolize attention. This is not a bug in the correction logic; it’s an artifact of having to choose one direction at a time. By the time BOB’s Power begins to collapse, the utility loss from that single failure drags down the other values as well. Even though the environment hasn’t changed, and even though BOB receives roughly equal numbers of interventions overall, he converges toward a single-value optimizer. ## Why this matters The outcome here isn’t a bug. It’s a structural failure that arises from how correction interacts with path dependence in multi-objective settings. If your agent receives corrections for deviation, and multiple objectives are decaying at different rates, the correction mechanism may create implicit priorities. Over time, those implicit priorities can become hard-coded. Even periodic nudges can introduce a structural drift that leads to collapse. This suggests that alignment strategies relying on periodic feedback (especially if they assume symmetry or stable value tracking) might be more fragile than they appear. We’re not claiming that all agents will behave this way. Our agents are stateless, memoryless, and non-adaptive. They don’t learn from experience or update their policies. They don’t model uncertainty. All of that might change the outcome. But we think this setup isolates a subtle failure mode that deserves more attention. Collapse didn’t happen because of bad incentives, outer misalignment, or RL gaming. It happened because internal correction mechanisms, when applied to symmetric multi-objective settings, introduced asymmetries that became irreversible over time. This might show up in more complex agents too, especially if their internal value tracking relies on relative gradients or urgency signals. [^1]: When an AI's actions do not match with what the human actually wants. [^2]: This means value collapse can emerge even when everything looks find and nobody is actively sabotaging it. [^3]: This means it’s designed to preserve certain internal conditions/parameters to maintain a target state, even if things outside it changes.