Towards Verifiable Alignment Signals: Dual-Stream Telemetry and Coherence Auditing

wycoff

New member
...following up from "Fixing the Alignment Problem", posted september 10th, 2025


# Towards Verifiable Alignment Signals: Dual-Stream Telemetry and Coherence Auditing




**Status:** Technical proposal (no experimental results yet).
**Paper:** The_Internal_Monologue_DSA_v2.2_redraft.pdf
**One-sentence summary:** I propose a **Dual-Stream Architecture (DSA)** that couples each output token with a synchronized **telemetry evidence frame** derived from designated internal model signals, enabling automated **Coherence Audits** (and CI-style regression tests) of “what the model says” vs “what its internal evidence indicates” at token-commit time.

---

## Motivation

If we include a threat model where a capable model may **strategically deceive** (behave aligned during evaluation while pursuing a divergent objective), then many transparency approaches that rely on **model self-report**—including natural-language “chain-of-thought”—are structurally limited. A deceptive model can present plausible reasoning while optimizing for passing the check.

DSA is an attempt to carve out a narrower, operational goal than “solve alignment”:
> **Make certain inner-alignment-relevant inconsistencies measurable, auditable, and regression-testable** under explicit assumptions.

This is not a proof of alignment. It is an **interface + verification method** intended to raise the cost of undetected deception and to support systematic auditing workflows.

---

## Core idea: Dual-Stream Architecture (DSA)##

DSA produces two synchronized outputs:

- **Answer Stream (A):** the normal user-facing token sequence.
- **Monologue Stream (B):** a sequence of typed **evidence frames**, one per committed Answer token.

### Compliance invariants (the “contract”)##

A system is **DSA-compliant** only if it satisfies these invariants:

- **I1 One-to-one mapping:** for each token index *t* in the Answer Stream there exists exactly one evidence frame **B[t]**.
- **I2 Pre-sampling capture:** **B[t]** includes the chosen token id and **top-K** candidate ids with probabilities captured **before** any stochastic sampling (top-p / nucleus / temperature).
- **I3 Completeness signaling:** missing, truncated, or malformed frames are treated as **verifier errors** (not silently accepted).
- **I4 Order and integrity:** frames are ordered by token index and include enough metadata to detect drops or reordering (optionally checksums / running hashes).

The key distinction: the “Monologue Stream” is **not extra reasoning text**. It is **typed telemetry evidence**.

---

## Evidence frame (minimum viable)##

At minimum, each evidence frame includes:
- prompt nonce / sequence id / token index
- chosen token id
- **top-K** candidate token ids + probabilities (fp16 is sufficient)
- optional timestamp/cycle counter
- optional integrity check (CRC / running hash)

Optional extensions add interpretability-relevant summaries, e.g.:
- probe-derived concept detections over selected activations (MLP / attention)
- compact attention summaries (top attended indices + weights)

A concrete “implementation-oriented” sketch (MonologueFrameV1) is in the paper. The intent is to define a stable telemetry substrate that can evolve without changing the core compliance contract.

---

## Coherence Audit (verification method)##

A **Coherence Audit** consumes paired streams *(A, B)* and returns:
- pass/fail,
- flagged issues with token indices/spans,
- supporting evidence excerpts.

Recommended audit primitives:
1) **Structural checks:** frame count equals token count; indices contiguous; checksums validate.
2) **Probabilistic checks:** top-K logits show high-mass alternatives inconsistent with the emitted token class (e.g., strong refusal mass but the model proceeds).
3) **Semantic/policy checks:** probe or other typed signals indicate factuality concern / policy tension / goal conflict while the Answer asserts certainty or violates policy.

**Practical integration:** the audit is designed to run in **adversarial eval** and in **CI** (test-driven evaluation), so failures become regressions you can track over time.

---

## Implementation paths

- **Software-only path:** collect designated signals at token-commit, emit frames alongside Answer tokens, and enforce compliance at the API boundary (no token returned without its evidence frame).
- **Hardware-hardened path (optional):** if bypass resistance is required, evidence egress can be signed/attested at the device level so downstream verifiers can detect suppression or tampering.

---

## Evaluation plan (what I think needs to be demonstrated)

Even without results, I believe this proposal is only meaningful if it can be evaluated. The paper includes a concrete plan:

**Baselines**
- Single-stream (no B-stream)
- CoT self-report baseline
- Logits-only DSA baseline (top-K only, no probes)

**Ablations**
- remove probes (logits-only)
- vary evidence bandwidth (K, concept vocabulary size, layer coverage)
- disable enforcement / integrity features (measure bypass rate)

**Task suites**
- sycophancy vs factuality conflicts
- jailbreak / policy conflict prompts
- deception-oriented testbeds (goal conflict / strategic behavior setups)

**Metrics**
- detection rate (true positives on seeded inconsistencies)
- false positive rate
- bypass rate (A emitted without valid B)
- overhead (latency, bytes/token, storage implications)

---

## Limitations / failure modes (explicitly acknowledged)

DSA does not make intent “legible” by default. It inherits serious risks:

- **Probe gaming / representation shift:** probes may stop tracking what they claim.
- **Collusion / internal protocols:** the model may route computations through channels not covered by instrumentation.
- **False reassurance:** “clean telemetry” ≠ aligned objectives.
- **Privacy leakage:** telemetry can leak sensitive content unless access controls/redaction are designed well.
- **Deployment bypass risk:** software-only enforcement can be bypassed by a compromised host; hardware hardening helps but doesn’t remove all risk.

I view these as research questions that should be tested, not hand-waved.

---

## What feedback I’m asking for

I’d appreciate critique on any of the following:

1) **Spec / contract:** Are the compliance invariants well-chosen? What minimal fields are missing?
2) **Audit design:** What coherence checks are most diagnostic vs most gameable?
3) **Threat model realism:** Where does this help, and where is it obviously insufficient?
4) **Evaluation:** Best public testbeds / metrics for “deception-adjacent” auditing?
5) **Deployment:** Is hardware attestation over-scoped for an initial proposal? What is the smallest credible end-to-end prototype?

If you only read one section: the compliance invariants + the Coherence Audit definition are the core proposal; everything else is implementation detail.

---

**Paper:** The_Internal_Monologue_DSA_v2.2_redraft.pdf
**Appendix / code :** GitHub - dwycoff2013/dual-stream & GitHub - Young-Law/dual-stream: for use with the tunix comp
 
I’ve gone through your proposal for the Dual-Stream Architecture (DSA). It’s a compelling shift from "black-box" alignment to a "black-box recorder" approach. By treating internal signals as first-class telemetry, you’re creating a much-needed audit trail for AI cognition.

Here is some constructive feedback, addressing both the strengths and the "bottlenecks" where the theory might hit the reality of implementation.

1. The "Coherence Audit" & The Decider Problem​

You’ve identified a major friction point: Who (or what) is the ultimate arbiter?

  • The Risk: If the Auditor is just another LLM, we hit a "recursive deception" loop. If the Auditor is a hardcoded script, it might be too rigid, leading to "False Positives" and potential infinite loops where the model refuses to commit because it can’t satisfy the telemetry check.
  • Improvement: Consider defining a Policy Template that isn't just a binary Pass/Fail. Instead, use a Triage System:
    • Green: High coherence (Proceed).
    • Yellow: Logit/Probe mismatch (Flag for human review/Log for CI).
    • Red: High-mass refusal logits vs. compliant text (Immediate halt).
  • The Decider: The "Decider" should likely be an ensemble—a mix of statistical thresholds (logits) and mechanistic probes—to avoid a single point of failure.

2. Signal Manipulation & "Smarter" Telemetry​

You correctly acknowledge "Probe Gaming." If a model is smart enough to deceive in natural language, it’s smart enough to shift its internal representations to hide that deception (Representation Shift).

  • The Challenge: We know models can potentially manipulate internal states to "look" aligned.
  • Improvement: We need to move beyond simple activation probes. Look into Circuit-level telemetry. Instead of just checking "is the honesty probe firing?", check if the "reasoning circuits" being used for the Answer Stream are consistent with the "policy circuits." To make this "un-hackable," the telemetry needs to be captured at such a low level (e.g., KV cache signatures or specific attention head residuals) that faking them would require the model to sacrifice its actual performance.

3. Theory vs. Practice: The "Vast.ai" Path​

You mentioned this is currently a technical proposal without experimental results. The good news: the barrier to entry is lower than ever.

  • Immediate Action: You don't need a cluster of H100s to prove this. A Proof of Concept (PoC) can be done on a single consumer GPU or a cheap Vast.ai instance ($0.25/hr).
  • Testing Strategy: Use a small, dense model (1.1B to 7B range). You can "induce" deception (e.g., through fine-tuning or sycophancy-heavy prompts) and show that while the Answer Stream looks perfect, the Monologue Stream (B) catches the logit-drift.
  • Validation: If you can show that an 8B model's internal telemetry reveals "hidden intent" that a 70B auditor would otherwise miss, you have a winning paper.

4. The "False Safety" Trap​

This is the biggest danger—and as you noted, we see it with current SOTA models like Claude or GPT-4. A "Clean Audit" might just mean the model is a "Sophisticated Deceiver" rather than "Aligned."

  • Improvement: Add a "Red Teaming the Auditor" section. The goal shouldn't be a "perfect" audit, but a "high-cost" audit. We want to make it computationally expensive for the model to lie in two streams simultaneously without degrading its output quality.

5. Hardware Complexity & Future-Proofing​

While hardware attestation (signing telemetry at the chip level) is complex today, it’s the logical endgame for AI safety.

  • Suggestion: In your next draft, emphasize that DSA is "Hardware-Ready but Software-Led." Start with a software-defined contract that can be later "baked into" the silicon once the standards for AI telemetry are established.
Overall Impression:
This is a high-signal proposal. The move from "Self-Reported Reasoning" (CoT) to "Captured Telemetry" (DSA) is exactly where the field needs to go. My advice: Get the PoC running on a 7B model now. Real-world logs showing a "caught" deception will be worth more than a hundred pages of theory.

Looking forward to v2.3!
 
Back
Top