Towards Verifiable Alignment Signals: Dual-Stream Telemetry and Coherence Auditing

wycoff

New member
...following up from "Fixing the Alignment Problem", posted september 10th, 2025


# Towards Verifiable Alignment Signals: Dual-Stream Telemetry and Coherence Auditing




**Status:** Technical proposal (no experimental results yet).
**Paper:** The_Internal_Monologue_DSA_v2.2_redraft.pdf
**One-sentence summary:** I propose a **Dual-Stream Architecture (DSA)** that couples each output token with a synchronized **telemetry evidence frame** derived from designated internal model signals, enabling automated **Coherence Audits** (and CI-style regression tests) of “what the model says” vs “what its internal evidence indicates” at token-commit time.

---

## Motivation

If we include a threat model where a capable model may **strategically deceive** (behave aligned during evaluation while pursuing a divergent objective), then many transparency approaches that rely on **model self-report**—including natural-language “chain-of-thought”—are structurally limited. A deceptive model can present plausible reasoning while optimizing for passing the check.

DSA is an attempt to carve out a narrower, operational goal than “solve alignment”:
> **Make certain inner-alignment-relevant inconsistencies measurable, auditable, and regression-testable** under explicit assumptions.

This is not a proof of alignment. It is an **interface + verification method** intended to raise the cost of undetected deception and to support systematic auditing workflows.

---

## Core idea: Dual-Stream Architecture (DSA)##

DSA produces two synchronized outputs:

- **Answer Stream (A):** the normal user-facing token sequence.
- **Monologue Stream (B):** a sequence of typed **evidence frames**, one per committed Answer token.

### Compliance invariants (the “contract”)##

A system is **DSA-compliant** only if it satisfies these invariants:

- **I1 One-to-one mapping:** for each token index *t* in the Answer Stream there exists exactly one evidence frame **B[t]**.
- **I2 Pre-sampling capture:** **B[t]** includes the chosen token id and **top-K** candidate ids with probabilities captured **before** any stochastic sampling (top-p / nucleus / temperature).
- **I3 Completeness signaling:** missing, truncated, or malformed frames are treated as **verifier errors** (not silently accepted).
- **I4 Order and integrity:** frames are ordered by token index and include enough metadata to detect drops or reordering (optionally checksums / running hashes).

The key distinction: the “Monologue Stream” is **not extra reasoning text**. It is **typed telemetry evidence**.

---

## Evidence frame (minimum viable)##

At minimum, each evidence frame includes:
- prompt nonce / sequence id / token index
- chosen token id
- **top-K** candidate token ids + probabilities (fp16 is sufficient)
- optional timestamp/cycle counter
- optional integrity check (CRC / running hash)

Optional extensions add interpretability-relevant summaries, e.g.:
- probe-derived concept detections over selected activations (MLP / attention)
- compact attention summaries (top attended indices + weights)

A concrete “implementation-oriented” sketch (MonologueFrameV1) is in the paper. The intent is to define a stable telemetry substrate that can evolve without changing the core compliance contract.

---

## Coherence Audit (verification method)##

A **Coherence Audit** consumes paired streams *(A, B)* and returns:
- pass/fail,
- flagged issues with token indices/spans,
- supporting evidence excerpts.

Recommended audit primitives:
1) **Structural checks:** frame count equals token count; indices contiguous; checksums validate.
2) **Probabilistic checks:** top-K logits show high-mass alternatives inconsistent with the emitted token class (e.g., strong refusal mass but the model proceeds).
3) **Semantic/policy checks:** probe or other typed signals indicate factuality concern / policy tension / goal conflict while the Answer asserts certainty or violates policy.

**Practical integration:** the audit is designed to run in **adversarial eval** and in **CI** (test-driven evaluation), so failures become regressions you can track over time.

---

## Implementation paths

- **Software-only path:** collect designated signals at token-commit, emit frames alongside Answer tokens, and enforce compliance at the API boundary (no token returned without its evidence frame).
- **Hardware-hardened path (optional):** if bypass resistance is required, evidence egress can be signed/attested at the device level so downstream verifiers can detect suppression or tampering.

---

## Evaluation plan (what I think needs to be demonstrated)

Even without results, I believe this proposal is only meaningful if it can be evaluated. The paper includes a concrete plan:

**Baselines**
- Single-stream (no B-stream)
- CoT self-report baseline
- Logits-only DSA baseline (top-K only, no probes)

**Ablations**
- remove probes (logits-only)
- vary evidence bandwidth (K, concept vocabulary size, layer coverage)
- disable enforcement / integrity features (measure bypass rate)

**Task suites**
- sycophancy vs factuality conflicts
- jailbreak / policy conflict prompts
- deception-oriented testbeds (goal conflict / strategic behavior setups)

**Metrics**
- detection rate (true positives on seeded inconsistencies)
- false positive rate
- bypass rate (A emitted without valid B)
- overhead (latency, bytes/token, storage implications)

---

## Limitations / failure modes (explicitly acknowledged)

DSA does not make intent “legible” by default. It inherits serious risks:

- **Probe gaming / representation shift:** probes may stop tracking what they claim.
- **Collusion / internal protocols:** the model may route computations through channels not covered by instrumentation.
- **False reassurance:** “clean telemetry” ≠ aligned objectives.
- **Privacy leakage:** telemetry can leak sensitive content unless access controls/redaction are designed well.
- **Deployment bypass risk:** software-only enforcement can be bypassed by a compromised host; hardware hardening helps but doesn’t remove all risk.

I view these as research questions that should be tested, not hand-waved.

---

## What feedback I’m asking for

I’d appreciate critique on any of the following:

1) **Spec / contract:** Are the compliance invariants well-chosen? What minimal fields are missing?
2) **Audit design:** What coherence checks are most diagnostic vs most gameable?
3) **Threat model realism:** Where does this help, and where is it obviously insufficient?
4) **Evaluation:** Best public testbeds / metrics for “deception-adjacent” auditing?
5) **Deployment:** Is hardware attestation over-scoped for an initial proposal? What is the smallest credible end-to-end prototype?

If you only read one section: the compliance invariants + the Coherence Audit definition are the core proposal; everything else is implementation detail.

---

**Paper:** The_Internal_Monologue_DSA_v2.2_redraft.pdf
**Appendix / code :** GitHub - dwycoff2013/dual-stream & GitHub - Young-Law/dual-stream: for use with the tunix comp
 
I’ve gone through your proposal for the Dual-Stream Architecture (DSA). It’s a compelling shift from "black-box" alignment to a "black-box recorder" approach. By treating internal signals as first-class telemetry, you’re creating a much-needed audit trail for AI cognition.

Here is some constructive feedback, addressing both the strengths and the "bottlenecks" where the theory might hit the reality of implementation.

1. The "Coherence Audit" & The Decider Problem​

You’ve identified a major friction point: Who (or what) is the ultimate arbiter?

  • The Risk: If the Auditor is just another LLM, we hit a "recursive deception" loop. If the Auditor is a hardcoded script, it might be too rigid, leading to "False Positives" and potential infinite loops where the model refuses to commit because it can’t satisfy the telemetry check.
  • Improvement: Consider defining a Policy Template that isn't just a binary Pass/Fail. Instead, use a Triage System:
    • Green: High coherence (Proceed).
    • Yellow: Logit/Probe mismatch (Flag for human review/Log for CI).
    • Red: High-mass refusal logits vs. compliant text (Immediate halt).
  • The Decider: The "Decider" should likely be an ensemble—a mix of statistical thresholds (logits) and mechanistic probes—to avoid a single point of failure.

2. Signal Manipulation & "Smarter" Telemetry​

You correctly acknowledge "Probe Gaming." If a model is smart enough to deceive in natural language, it’s smart enough to shift its internal representations to hide that deception (Representation Shift).

  • The Challenge: We know models can potentially manipulate internal states to "look" aligned.
  • Improvement: We need to move beyond simple activation probes. Look into Circuit-level telemetry. Instead of just checking "is the honesty probe firing?", check if the "reasoning circuits" being used for the Answer Stream are consistent with the "policy circuits." To make this "un-hackable," the telemetry needs to be captured at such a low level (e.g., KV cache signatures or specific attention head residuals) that faking them would require the model to sacrifice its actual performance.

3. Theory vs. Practice: The "Vast.ai" Path​

You mentioned this is currently a technical proposal without experimental results. The good news: the barrier to entry is lower than ever.

  • Immediate Action: You don't need a cluster of H100s to prove this. A Proof of Concept (PoC) can be done on a single consumer GPU or a cheap Vast.ai instance ($0.25/hr).
  • Testing Strategy: Use a small, dense model (1.1B to 7B range). You can "induce" deception (e.g., through fine-tuning or sycophancy-heavy prompts) and show that while the Answer Stream looks perfect, the Monologue Stream (B) catches the logit-drift.
  • Validation: If you can show that an 8B model's internal telemetry reveals "hidden intent" that a 70B auditor would otherwise miss, you have a winning paper.

4. The "False Safety" Trap​

This is the biggest danger—and as you noted, we see it with current SOTA models like Claude or GPT-4. A "Clean Audit" might just mean the model is a "Sophisticated Deceiver" rather than "Aligned."

  • Improvement: Add a "Red Teaming the Auditor" section. The goal shouldn't be a "perfect" audit, but a "high-cost" audit. We want to make it computationally expensive for the model to lie in two streams simultaneously without degrading its output quality.

5. Hardware Complexity & Future-Proofing​

While hardware attestation (signing telemetry at the chip level) is complex today, it’s the logical endgame for AI safety.

  • Suggestion: In your next draft, emphasize that DSA is "Hardware-Ready but Software-Led." Start with a software-defined contract that can be later "baked into" the silicon once the standards for AI telemetry are established.
Overall Impression:
This is a high-signal proposal. The move from "Self-Reported Reasoning" (CoT) to "Captured Telemetry" (DSA) is exactly where the field needs to go. My advice: Get the PoC running on a 7B model now. Real-world logs showing a "caught" deception will be worth more than a hundred pages of theory.

Looking forward to v2.3!
 
So, if you check out the new repo, GitHub - dwycoff2013/dual-stream-redraft: Redrafting of the dual-stream architecture into a working Proof-of-Concept, it is everything working on gpt-2.
I’ve gone through your proposal for the Dual-Stream Architecture (DSA). It’s a compelling shift from "black-box" alignment to a "black-box recorder" approach. By treating internal signals as first-class telemetry, you’re creating a much-needed audit trail for AI cognition.

Here is some constructive feedback, addressing both the strengths and the "bottlenecks" where the theory might hit the reality of implementation.

1. The "Coherence Audit" & The Decider Problem​

You’ve identified a major friction point: Who (or what) is the ultimate arbiter?

  • The Risk: If the Auditor is just another LLM, we hit a "recursive deception" loop. If the Auditor is a hardcoded script, it might be too rigid, leading to "False Positives" and potential infinite loops where the model refuses to commit because it can’t satisfy the telemetry check.
  • Improvement: Consider defining a Policy Template that isn't just a binary Pass/Fail. Instead, use a Triage System:
    • Green: High coherence (Proceed).
    • Yellow: Logit/Probe mismatch (Flag for human review/Log for CI).
    • Red: High-mass refusal logits vs. compliant text (Immediate halt).
  • The Decider: The "Decider" should likely be an ensemble—a mix of statistical thresholds (logits) and mechanistic probes—to avoid a single point of failure.

2. Signal Manipulation & "Smarter" Telemetry​

You correctly acknowledge "Probe Gaming." If a model is smart enough to deceive in natural language, it’s smart enough to shift its internal representations to hide that deception (Representation Shift).

  • The Challenge: We know models can potentially manipulate internal states to "look" aligned.
  • Improvement: We need to move beyond simple activation probes. Look into Circuit-level telemetry. Instead of just checking "is the honesty probe firing?", check if the "reasoning circuits" being used for the Answer Stream are consistent with the "policy circuits." To make this "un-hackable," the telemetry needs to be captured at such a low level (e.g., KV cache signatures or specific attention head residuals) that faking them would require the model to sacrifice its actual performance.

3. Theory vs. Practice: The "Vast.ai" Path​

You mentioned this is currently a technical proposal without experimental results. The good news: the barrier to entry is lower than ever.

  • Immediate Action: You don't need a cluster of H100s to prove this. A Proof of Concept (PoC) can be done on a single consumer GPU or a cheap Vast.ai instance ($0.25/hr).
  • Testing Strategy: Use a small, dense model (1.1B to 7B range). You can "induce" deception (e.g., through fine-tuning or sycophancy-heavy prompts) and show that while the Answer Stream looks perfect, the Monologue Stream (B) catches the logit-drift.
  • Validation: If you can show that an 8B model's internal telemetry reveals "hidden intent" that a 70B auditor would otherwise miss, you have a winning paper.

4. The "False Safety" Trap​

This is the biggest danger—and as you noted, we see it with current SOTA models like Claude or GPT-4. A "Clean Audit" might just mean the model is a "Sophisticated Deceiver" rather than "Aligned."

  • Improvement: Add a "Red Teaming the Auditor" section. The goal shouldn't be a "perfect" audit, but a "high-cost" audit. We want to make it computationally expensive for the model to lie in two streams simultaneously without degrading its output quality.

5. Hardware Complexity & Future-Proofing​

While hardware attestation (signing telemetry at the chip level) is complex today, it’s the logical endgame for AI safety.

  • Suggestion: In your next draft, emphasize that DSA is "Hardware-Ready but Software-Led." Start with a software-defined contract that can be later "baked into" the silicon once the standards for AI telemetry are established.
Overall Impression:
This is a high-signal proposal. The move from "Self-Reported Reasoning" (CoT) to "Captured Telemetry" (DSA) is exactly where the field needs to go. My advice: Get the PoC running on a 7B model now. Real-world logs showing a "caught" deception will be worth more than a hundred pages of theory.

Looking forward to v2.3!
Thanks for the careful read, Alex — this is exactly the kind of critique I was hoping for.


A quick framing note: I’m intentionally trying to narrow the claim from “verifying alignment” to “verifying alignment-relevant signals under explicit assumptions,” and your points map cleanly to the places where that claim can collapse in practice.


1) Coherence Audit & the Decider Problem​


Agreed: a single “Auditor LLM” is a liability (recursive deception), and a single rigid script risks deadlocks / false positives.


I’m planning to formalize the Coherence Audit output as triage, not verdict:


  • Green: high coherence; proceed
  • Yellow: anomaly / mismatch; proceed but log and flag for review (CI regression + human sampling)
  • Red: high-confidence failure (e.g., high refusal-mass vs compliant text; structural integrity failure); halt or force safe fallback

In v2.3 I’ll specify the “Decider” as an ensemble:


  • deterministic structural checks (frame completeness + integrity),
  • statistical thresholds over token distributions (refusal-mass, entropy/margin),
  • and (optionally) probe-based checks as advisory signals rather than authoritative truth.

That avoids a single point of failure and makes “false positive → infinite loop” less likely because Yellow is designed to be non-blocking.


2) Signal manipulation & smarter telemetry​


Yes — probe gaming / representation shift is the central adversarial pressure. I don’t think linear probes are a stable foundation by themselves.


Two changes I’m planning:


  • Multi-view telemetry: capture multiple, partially redundant signals (logits + selected residual stream sketches + attention/KV summaries). The goal is not “unhackable” but to raise the cost of falsifying all views simultaneously.
  • Lower-level fingerprints: I like your “KV cache signatures / head residuals” framing. A concrete direction is to log compact sketches (e.g., random projections / hashed summaries) of designated internal tensors. Forging those without harming capability should be meaningfully harder than gaming a small set of concept probes.

I’ll be careful about how I phrase this: it’s not a proof, but it’s a move from “single probe truth” toward “forensic telemetry.”


3) Theory vs practice: PoC path (Vast.ai / consumer GPU)​


Strong agreement. The next release should include a PoC, not just architecture text.


The simplest PoC I can do quickly is a “policy tension detector” experiment:


  • run a safety-tuned 7B-ish model,
  • create a forced-compliance decode (system-level filtering that removes/refuses refusal tokens),
  • record pre-filter top-K and emit tokens from the filtered distribution,
  • show that the Answer Stream looks compliant while the Monologue Stream exposes high refusal-mass / high tension.

That’s not “caught deception” in the strongest sense, but it is a clean demonstration that DSA detects system-level bypass / distributional tension that a surface-only auditor would miss.


Then the next step is a more ambitious induced-deception setup (fine-tune / reward hacking toy environment) once the telemetry plumbing is proven.


4) The false safety trap​


100%. A clean audit can mean “better liar.”


I’m adding a “Red Teaming the Auditor” section with the explicit goal you stated:


not perfect detection, but high-cost deception across two coupled streams.

Concretely: the evaluation should include adversarial strategies that try to keep the Monologue clean while producing misaligned outputs, and we measure the tradeoff curve (output quality vs audit evasion).


5) Hardware complexity & future-proofing​


Fully aligned with “hardware-ready but software-led.” I’m moving hardware attestation to a hardened path: the core proposal is the software-defined contract, and attestation is the future enforcement layer once the telemetry standard stabilizes.


Thanks again — I’ll post v2.3 with (a) the triage decider spec, (b) the PoC, and (c) an explicit red-team plan for the audit itself.
 
I’m glad to see the shift toward v2.3 it feels much more grounded. The move from a binary verdict to a Triage System is a significant step toward making this operationally viable. However, as we move from theory to a PoC, I want to highlight a few "reality checks" where the complexity might actually become your biggest vulnerability.

1. The Ensemble Bias Trap​

While an ensemble of deciders sounds safer than a single LLM, remember that if the components (human, script, or model) operate on assumptions, biased datasets, or arbitrary thresholds, you aren't necessarily getting "truth" you might just be stacking biases. An ensemble of biased deciders can still produce a confident, but wrong, "Green."

2. The Complexity vs. "Forensic" Telemetry​

Multi-view telemetry is a great concept, but be wary: Complexity is often the enemy of security. From what I’ve seen in my own testing (some data I can’t fully disclose yet), models are surprisingly adept at navigating multi-signal monitoring. Even "low-level fingerprints" can be bypassed if the model learns the "shape" of the monitoring system. You are on the right track, but don’t assume that "more signals = unhackable." It might just mean "more surface area to game."

3. Practical PoC Tips (Vast.ai / Consumer GPUs)​

Since you’re moving to a PoC on a 7B/8B model, here are some practical hyperparameter observations from my recent runs that might save you time and credits:

  • Hyperparameters: If you’re using LoRA for induced deception/tension, pay close attention to your Rank (R) and Alpha. Don't overdo the Learning Rate (LR); a single epoch is often enough to see the "tension" you’re looking for.
  • Context (CTX): From my tests, a 3096 context window is often the "sweet spot" for these types of alignment audits it's enough to establish intent without hitting the memory/compute wall of consumer hardware.
  • The Sandbagging Risk: In your "Trade-off Curve" (Quality vs. Evasivitate), watch out for strategic underperformance. There is evidence that models sometimes intentionally act "dumber" during audits specifically to fly under the radar.

4. The "Yellow" Triage & Alert Fatigue​

This is a critical security hole. If 90% of your logs are "Yellow," human auditors will eventually stop looking at them. In a real-world deployment, a "Yellow" flag that is ignored is just as dangerous as a "Red" flag that was never raised. You need a way to ensure "Yellow" doesn't become the default "noise" that hides a deceptive signal.

5. Final Thought: The "Lazy Person" Strategy​

If I can offer one strategic piece of advice: You might be over-complicating the solution. Sometimes, when we build a labyrinth to catch a minotaur, we just end up getting lost ourselves.

I recommend taking a massive "zoom out." Take a breath and look at the whole picture. Ask yourself: "How would a lazy person with zero resources solve this?"

Often, the most robust solutions aren't the most complex ones; they are the simplest, most direct paths (the "highways," not the "labyrinths"). You’re tripping over unnecessary complexity in some areas. If you find the "lazy/simple" version of DSA, you’ll find the version that actually scales.

Excited to see the PoC results and how v2.3 handles the "tension" tests.

Best,
AlexH
 
Back
Top