Agent Behavior Signals

Raw traces → Structured attempts → Behavioral features → Outcome labels → Optimization
Two contexts, different failure modes. Eval signals are deterministic and test-suite-verified. Product signals are multi-turn and user-in-the-loop. This inventory separates them because signals that are always-null in one context create noise in the other.

Attempt-Span Schema — trial › attempt_span[] Each attempt = one uninterrupted agent run within a trial

treatment

attempt1_tool_cap
retry_tool_cap
agent_type
model_id
dataset
concurrency

behavior

turn_count
tool_call_count
edit_calls
bash_mediated_edits
touched_files
diff_lines
tool_sequence[]
localization_mode
localized_correct_file
files_searched_before_edit

reasoning

has_reasoning_blocks
reasoning_tokens_total
plan_steps_detected
self_correction_count
phase_transitions[]
plannotator_phases[]

tokens

input_tokens_total
output_tokens_total
cache_read_tokens
cache_write_tokens
context_growth_curve[]
cost_usd

verification

reward
patch_outcome_label
fail_to_pass_count
pass_to_fail_count
pass_to_pass_count
fail_to_fail_count
patch_applies

infra

wall_time_s
timed_out
attempt_index
recovery_mode
final_failure_mode

Telemetry Gap — two distinct issues

1. JSONL not collected from Modal: telemetry.sessions: true is set, so the agent writes per-turn JSONL to /root/.subq/agent/sessions/ inside each container — but this directory is never collected as a Modal artifact. <think> tokens and per-turn token breakdown are unavailable from eval runs.

2. Plannotator phases ARE available now: plannotator-phase events emit to events/stdout in real time during eval runs. These are not blocked by the JSONL gap — they can be parsed from stdout.txt today. Do not conflate these two issues.

Fix JSONL gap — add to artifact config

{ "source": "/root/.subq/agent/sessions",
  "destination": "sessions",
  "type": "directory" }

Attempt-1 vs. retry performance must be tracked separately. Autoresearch operates on this split. Aggregate metrics across attempts mask whether improvements come from better first-pass performance or better recovery. All signals in this tab should be indexed by attempt_index. Treatment variables (attempt1ToolCap, retryToolCap) differ between attempts — without recording them per-attempt, fair attribution is impossible.

Treatment Variables — the knobs applied to this run

0 recorded per-attempt ▼

These are the independent variables — the treatment applied to each run. If you're comparing agent candidates, these must be recorded per-attempt or you can't do fair attribution. Currently these exist as harness config but are not persisted in result.json.

Missing entirely from the schema. Without treatment variables recorded per-attempt, you cannot distinguish "agent A is better than agent B" from "agent A had more tool calls allowed." This is the most critical gap for autoresearch comparisons.

Signal	Source	Type	Status	Description	Why it matters
attempt1_tool_cap ★ treatment.attempt1_tool_cap	Harness	int	Not recorded	Maximum tool calls allowed for the first attempt. From harness config `attempt1ToolCap`. This is the primary knob for controlling agent exploration depth.	An agent that resolved a task with 50 tool calls allowed cannot be fairly compared to one with 20. Must be recorded.
retry_tool_cap ★ treatment.retry_tool_cap	Harness	int	Not recorded	Maximum tool calls allowed for retry attempts. Different from attempt1 — retries typically get more budget. From harness config `retryToolCap`.	Retry performance is meaningless without knowing the retry budget. Two agents with different retry caps cannot be compared.
agent_type ★ treatment.agent_type	Harness	str	Not recorded	Which agent scaffold was used: `one-shot`, `subq`, `shotgun`, `mini-swe-agent`, etc. From `--agent` CLI flag. Already in `eval-history.jsonl` but not in per-attempt `result.json`.	Must be in the attempt-span schema, not just the run-level log.
model_id ★ treatment.model_id	Harness	str	Not recorded	Exact model identifier used for this attempt. From `AGENT_MODEL` env or config. Include the full string, not just the provider — `anthropic/claude-sonnet-4` vs. `claude-sonnet-4-20250514` matters.	Without this, you can't separate model quality from scaffold quality.
dataset treatment.dataset	Harness	str	Not recorded	Dataset and version: `swebench-verified@1.0`, `swesmith@1.0`, etc. Different datasets have different difficulty distributions.	Same agent, same model — different resolve rate on different datasets. Must be recorded for any cross-run comparison.
concurrency treatment.concurrency	Harness	int	Not recorded	Number of parallel containers for this run. Affects inference server load, which affects wall_time_s. From `--concurrency` flag.	Explains variance in wall_time_s across runs with different concurrency levels.
timeout_s treatment.timeout_s	Harness	int	Not recorded	Container timeout in seconds. Different timeout limits change the feasible set of tasks — Sphinx tasks need 50+ min.	A run with 30 min timeout will have a structurally lower resolve rate on Sphinx tasks than one with 60 min.

Behavioral Trajectory — how the agent acts

6 extracted 6 not yet ▼

Captures what the agent did: which tools it called, in what order, which files it touched, and how the controller classified its localization strategy. These signals come from stdout.txt [subq-eval] lines and result.json. Highest reliability — source-of-truth from the harness.

Eval-specific context: From 500 tasks at 46.7% resolved — dominant failure mode is wrong localization or incomplete patch (~44%), regressions are ~4%, infra failures ~6%. Localization quality is the single biggest lever for improvement.

Typical tool sequences observed in eval traces

Resolved

Read→ Bash(grep)→ Read→ Reason→ Edit→ Run tests

bash-mediated edit

Read→ Bash(grep)→ Bash(sed -i)→ Bash(python) editCalls=0, touchedFiles≠∅ — invisible to edit counter

Failed / no patch

Read→ Bash(grep)→ Bash(grep)→ Bash(grep) many reads, no edits

Signal	Source	Type	Status	Description	Validity / Pitfalls
turn_count behavior.turn_count	Harness	int	Extracted	Number of agent turns before termination. Parsed from `[subq-eval] Turns: N`.	⊕ Poor proxy alone. Turn count conflates task difficulty with agent efficiency. A 3-turn success on a hard task ≠ a 3-turn failure on an easy one. Use conditioned on task difficulty.
tool_call_count behavior.tool_call_count	Harness	int	Extracted	Total tool calls made across the attempt. From `[subq-eval] Tool calls: N`.	⊕ Counts bash and read equally. A 40-call success with 38 reads looks the same as 40 edits. Decompose by tool type for signal quality.
edit_calls behavior.edit_calls	Harness	int	Extracted	Calls to the `edit` / `str_replace_editor` tool specifically. From `[subq-eval] Edit calls: N`.	✓ More targeted than total tool calls. But incomplete — does not capture bash-mediated writes (see `bash_mediated_edits`).
bash_mediated_edits ★ behavior.bash_mediated_edits pi-mono · swival · OpenTraces	Harness	bool	Extracted	True when `edit_calls = 0` but `touched_files ≠ ∅`. Agent modified files via bash (`sed -i`, heredocs) — invisible to the edit counter.	⊕ Common failure pattern in evaluated traces. These patches are structurally different (no diff capture at call time). Worth flagging as a separate behavioral cluster.
touched_files behavior.touched_files[]	Harness	list[str]	Extracted	File paths modified during the attempt. From `[subq-eval] Touched files: path (N diff lines)`. Includes bash-modified files.	✓ Reliable. Derived from diff against pre-attempt snapshot, not from tool call logging. Gold standard for "did the agent change anything."
diff_lines behavior.diff_lines	Harness	int	Extracted	Total lines changed across all touched files. Sum of per-file diff line counts from stdout.	⊕ Large diffs aren't always better. Reformatting or whitespace changes inflate this. Correlates weakly with correctness alone.
tool_sequence ★ behavior.tool_sequence[] hermes-agent-reasoning-traces · pi-mono · OpenTraces TAO loop	Harness	list[str]	Not extracted	Ordered list of tool names called. Enables bigram/trigram analysis: `read→edit`, `bash→bash→bash` (stuck loop), `edit→bash(test)` (verify-after-edit).	⊕ Requires parsing full stdout interleave, not just summary lines. High value for behavioral clustering. Not yet implemented.
localization_mode behavior.localization_mode	Harness	enum	Extracted	Controller classification of how the attempt was located: `attempt1Mode`, `trialRecoveryMode`, `finalFailureMode`. From `LocalizationSignal` on `HarborResult`.	✓ Structural signal from the harness controller, not the agent. Clean separation of concerns.
Localization Quality — wrong file = no chance (NEW)
localized_correct_file ★ behavior.localized_correct_file	Derived	bool	Not extracted	Did the agent's first edit target a file that appears in the gold patch? `touched_files[0] ∈ gold_patch_files`. Wrong file on first edit = localization failure, distinct from patch quality failure.	✓ Separates "found the right place but wrote the wrong fix" from "never found the right place." These have completely different root causes: the first is a reasoning failure, the second is a search/codebase-nav failure. Derivable from existing data + gold patch.
files_searched_before_edit ★ behavior.files_searched_before_edit	Harness	int	Not extracted	How many distinct files the agent read/grepped before making its first edit. Proxy for localization thoroughness.	⊕ Requires stdout parsing. Over-exploration (high reads, delayed edit) correlates with uncertainty and lower patch quality in SWE-bench settings. But 0 reads before edit = blind patching, also bad.
Derived
files_read_before_edit behavior.files_read_before_edit	Harness	int	Not extracted	How many distinct files the agent read (Read tool only) before making its first edit. Narrower than `files_searched_before_edit`.	⊕ Requires stdout parsing.
edit_to_read_ratio behavior.edit_to_read_ratio (derived)	Derived	float	Not extracted	Ratio of edit calls (including bash-mediated) to read calls. High ratio = decisive agent; low ratio = over-exploratory or stuck.	⊕ Needs tool_sequence first. Good discriminator between behavioral clusters once sequence data exists.

Reasoning Signals — how the agent thinks

1 available (plannotator) 5 blocked by JSONL gap ▼

Signals derived from the agent's internal reasoning: extended thinking blocks, planning phases, self-corrections.

Availability split: Plannotator phases emit to stdout in real time and are available now. All other reasoning signals require Claude session JSONL (~/.subq/agent/sessions/*.jsonl), which is not collected from Modal eval runs.

Signal	Source	Type	Status	Description	Validity / Pitfalls
Available now — from stdout events
plannotator_phases ★ reasoning.plannotator_phases[] pi-mono custom_message events · hermes-agent-reasoning-traces	Stdout	list[obj]	Available	Plannotator event objects from stdout: `{"event":"plannotator-phase","phase":"implement"}`. Ground-truth phase timeline. Emits in real time — does NOT require JSONL collection.	✓ Structured, agent-emitted, no inference required. Best leading indicator of agent trajectory quality. Parse from `stdout.txt` today.
phase_transitions ★ reasoning.phase_transitions[]	Stdout	list[str]	Available	Sequence of detected phase labels: `explore` → `localize` → `implement` → `verify`. Derived from plannotator_phases. Already available from stdout.	✓ Zero parsing heuristic needed — purely structural once plannotator events are collected.
Blocked — requires session JSONL from Modal
has_reasoning_blocks reasoning.has_reasoning_blocks	Claude	bool	Needs artifact	Whether the model emitted extended thinking blocks (<think> tokens) in any turn. In session JSONL: message events with `type: "thinking"` content blocks.	⊕ Binary presence is weak; length distribution is stronger.
reasoning_tokens_total reasoning.reasoning_tokens_total	Claude	int	Needs artifact	Sum of output tokens attributed to thinking blocks across all turns. Requires summing `content[].text.length` for thinking blocks from JSONL.	⊕ Reasoning length has diminishing returns and can anti-correlate with performance on straightforward tasks (overthinking).
plan_steps_detected reasoning.plan_steps_detected	Claude	int	Needs artifact	Heuristic count of numbered/bulleted plan steps in the first assistant turn's text. Proxy for whether the agent formed an explicit plan before acting.	⊕ Regex-based, fragile. Agents can reason implicitly without numbered steps. Use as soft signal, not hard feature.
self_correction_count reasoning.self_correction_count	Claude	int	Needs artifact	Times the agent explicitly revisited or abandoned a prior approach (“actually…”, “wait, that won't work…”). Detectable from reasoning text in session JSONL.	⊕ Direction is ambiguous in coding eval context. Correction cycles often correlate with confusion and looping, not capability. In SWE-bench data, high self-correction on failed tasks = stuck agent revisiting the same wrong approach. Must be paired with `reward` before interpreting. Useful signal but not unconditionally positive.

Verification Layer — what is objectively correct

7 extracted 1 derived ▼

Ground-truth outcome signals from the verifier. These are the most reliable signals in the inventory — computed against the task's test suite, not inferred from the agent trace. Source: verifier/report.json and result.json.

patch_outcome_label — replaces 4 boolean columns

correct_patch

reward = 1.0. All FAIL_TO_PASS tests now pass, no PASS_TO_FAIL regressions.

partial_patch

Some FAIL_TO_PASS tests pass but not all, OR FAIL_TO_PASS > 0 with PASS_TO_FAIL > 0. Meaningful progress, incomplete fix.

regressive_patch

reward = 0. PASS_TO_FAIL > 0. Agent broke previously-passing tests without fixing the target failure.

no_effect_patch

touched_files is non-empty but no test result changed. Patch was syntactically valid but semantically inert.

unverifiable

Patch did not apply cleanly, or verifier crashed/timed out. Cannot classify.

Signal	Source	Type	Status	Description	Validity / Pitfalls
reward ★ verification.reward TRAIL · swival · SWE-bench	Harness	float	Extracted	Binary reward: 1.0 = all FAIL_TO_PASS tests pass with no regressions. 0.0 otherwise. From `result.json`.	✓ Ground truth for autoresearch metric. But binary — misses partial progress. Use alongside `patch_outcome_label` for richer signal.
patch_outcome_label verification.patch_outcome_label	Derived	enum	Not extracted	5-value enum derived from verifier test counts. See enum block above. Computed from `verifier/report.json` FAIL_TO_PASS / PASS_TO_FAIL counts.	✓ Discriminates partial progress and regressions that reward=0 collapses together. Critical for failure analysis and cluster labeling.
fail_to_pass_count ★ verification.fail_to_pass_count TRAIL · SWE-bench verifier	Harness	int	Extracted	Tests that were failing before the patch and pass after. The primary "fixed" signal. From `verifier/report.json → FAIL_TO_PASS`.	✓ Gold signal. Task-level ground truth. Use as the numerator for partial fix rate.
pass_to_fail_count ★ verification.pass_to_fail_count TRAIL · SWE-bench verifier	Harness	int	Extracted	Tests that were passing before and fail after — regressions introduced by the patch. From `verifier/report.json → PASS_TO_FAIL`.	✓ Any non-zero value is a quality signal. Regressions should be weighted heavily in any scoring function.
pass_to_pass_count verification.pass_to_pass_count	Harness	int	Extracted	Tests that passed before and pass after — preserved behavior. From `verifier/report.json → PASS_TO_PASS`.	⊕ High counts expected on every run. Useful only as denominator for regression rate, not as a quality signal by itself.
fail_to_fail_count verification.fail_to_fail_count	Harness	int	Extracted	Tests that failed before and still fail after — no progress on these. From `verifier/report.json → FAIL_TO_FAIL`.	⊕ High count means partial fix or complete miss. Good for diagnosing scope of remaining failures.
patch_applies verification.patch_applies	Harness	bool	Extracted	Whether the generated patch applied cleanly to the codebase. False → unverifiable outcome. From `result.json`.	✓ Pre-condition for all other verification signals. Must gate the entire verification layer.

Cost & Infra — what it costs to get there

3 extracted 4 blocked by JSONL gap ▼

Token economics, cache utilization, wall time, and infrastructure-level signals. Harness-side signals (wall time, timeout) are already collected. Token-level signals require session JSONL from Modal containers.

Signal	Source	Type	Status	Description	Validity / Pitfalls
wall_time_s ★ infra.wall_time_s	Harness	float	Extracted	Wall clock time for the attempt in seconds. From `result.json → duration`.	⊕ Affected by container cold start, inference server load, concurrency. Not pure agent time. Use as relative comparator within a run, not across runs.
timed_out ★ infra.timed_out	Harness	bool	Extracted	Whether the container hit its timeout limit before completing. Sphinx tasks routinely timeout at 50 min. From `result.json`.	⊕ Timeout rate is task-dependent (Sphinx >> Django on verified-mini). Must segment by task set before using as agent quality signal.
attempt_index infra.attempt_index	Harness	int	Extracted	Which attempt within the trial this was (0-indexed). Combined with `localization_mode`, indicates whether recovery was triggered.	✓ Essential for decomposing trial-level signals into attempt-span signals. Required for the attempt-span schema to be meaningful.
input_tokens_total ★ tokens.input_tokens_total	Claude	int	Needs artifact	Total input tokens consumed across all turns. In session JSONL: sum of `usage.input` per message event.	⊕ Grows monotonically (context accumulation). Context growth curve is more informative than total.
cache_read_ratio ★ tokens.cache_read_ratio	Derived	float	Needs artifact	`cacheRead / (input + cacheRead)` per turn. From session JSONL `usage.cacheRead`. High ratio = efficient context reuse.	✓ Direct cost efficiency proxy. Cache reads are ~10x cheaper than input tokens on Claude. Low ratio on long attempts suggests cache invalidation or poor prompt structure.
cost_usd ★ tokens.cost_usd	Claude	float	Needs artifact	Estimated total USD cost of the attempt. From session JSONL `usage.cost` object. Aggregated across all turns.	⊕ Cost per resolved task (cost / reward) is the key metric for optimization. Raw cost without outcome is incomplete.
context_growth_curve ★ tokens.context_growth_curve[] OpenTraces tokens schema · pi-mono session traces	Derived	list[int]	Needs artifact	Input token count at each turn. From session JSONL: `[usage.input for each message event in order]`. Shows context window growth trajectory.	⊕ Confirmed in local sessions: 9k → 45k input tokens over ~15 turns. Fast growth = agent accumulating large context or codebase dumps. Plateau = efficient summarization.

Different context, different failure modes

These signals apply to multi-turn, user-in-the-loop sessions. Drift, assumption lock, and scope bleed are the dominant failure modes here — they barely register in deterministic one-shot SWE-bench evals. Signals in this tab will be always-null for eval runs and should not be collected there. Conversely, eval-tab signals like patch_outcome_label and fail_to_pass_count do not exist in product sessions (no test harness).

Stat caveat (arxiv 2602.07338): The “~30% performance drop from drift” finding is from multi-turn chat agents (Lost in Conversation). For coding evals the number is almost certainly wrong — from our own eval data (500 tasks, 46.7% resolved), wrong localization/incomplete patch dominates (~44%), regressions are ~4%, infra failures ~6%. Drift in the chat-agent sense barely registers in SWE-bench. The stat is valid only for this tab's context.

Reference Datasets

PatronusAI/TRAIL — 148 annotated traces, plan_optimality + instruction_adherence labels lambda/hermes-agent-reasoning-traces — 14.7k samples, <think> blocks, 24 avg turns badlogicgames/pi-mono — 627 sessions, tree-structured, branch summaries jedisct1/agent-traces-swival — 10.7k rows, security audits, OpenTraces-compatible OpenTraces — TAO loop schema v0.3, evidence tiers, git attribution

Agent Anticipation — did it understand what was actually needed

0 extracted 10 require LLM judge or session JSONL ▼

Did the agent understand the task from the first turn? Did it anticipate what the user actually needed, or did it make assumptions, drift scope, or misread intent? These signals are derived post-session — they require either a session JSONL artifact or an LLM-as-judge pass over the conversation. They answer a fundamentally different question than verification: not was the output correct, but did the agent understand what correct meant.

Key tension: user frustration ≠ agent failure. A clear prompt that the agent misread is agent failure. An underspecified prompt where the agent made a reasonable guess is attribution-unclear. These buckets must be separated before using frustration signals as training signal.

session_attribution — why did the session go wrong (or right)

correct_anticipation

Agent understood intent from turn 1. First plan matched eventual solution. Minimal course correction. User accepted without steering. This is taste — knowing what the user needed before they fully articulated it.

scope_conflation

Task boundaries dissolved mid-session. Adjacent concerns got folded in — agent started solving related but un-asked problems. Output expanded beyond the original request. Often looks like thoroughness but is actually drift.

agent_misunderstood

Prompt was clear; agent paraphrased incorrectly in first reasoning block or dove in the wrong direction. Detectable when: user correction arrives early, agent's first tool call goes to unrelated area, plan-to-solution divergence is high.

underspecified_prompt

Prompt was too vague for any agent to resolve without guessing. Agent's guess may have been reasonable. Frustration from user here is not a clean training signal — attributing it to agent failure contaminates the label. Detectable by: prompt length, absence of context, agent asking clarifying questions.

Signal	Source	Type	Status	Description	Validity / Pitfalls
task_paraphrase_accuracy anticipation.task_paraphrase_accuracy	LLM judge	float	Not extracted	Semantic similarity between the user's original request and the agent's restatement of the task in its first reasoning block or response. Low score = agent misread intent from the start.	⊕ Requires session JSONL + LLM judge. TRAIL dataset provides human-annotated `instruction_adherence` labels as a reference schema. First-turn restatement is the earliest detectable signal of understanding failure.
plan_optimality ★ anticipation.plan_optimality TRAIL (human-annotated) · hermes-agent-reasoning-traces	LLM judge	enum	Not extracted	Was the initial plan the agent formed close to optimal for the task? Drawn from TRAIL's annotation schema. Values: `optimal` / `suboptimal` / `wrong_direction`. Compare first reasoning block plan to eventual solution path.	✓ TRAIL benchmark shows best models achieve only 11% accuracy detecting planning errors — this is genuinely hard. High-value signal precisely because it's hard to fake.
first_tool_precision anticipation.first_tool_precision	Harness	bool	Not extracted	Did the agent's first substantive tool call go to a file that ended up in `touched_files`? True = agent localized correctly from the start.	✓ Derivable from existing harness data once tool_sequence is extracted. Clearest behavioral signal of genuine task understanding vs. exploratory guessing.
scope_drift anticipation.scope_drift	Derived	float	Not extracted	Ratio of files touched to files minimally required. Score of 1.0 = minimal footprint. Score of 3.0 = agent touched 3x more than needed.	⊕ "Minimally required" needs LLM judge in live sessions. For eval runs: compare touched files to the gold patch's file set.
clarification_sought anticipation.clarification_sought	Claude	bool	Needs artifact	Did the agent ask a clarifying question before acting? First assistant turn contains a question directed at the user rather than a plan or tool call.	⊕ Context-dependent. On a vague prompt it's correct behavior. On a clear prompt it signals the agent didn't read carefully.
assumption_count anticipation.assumption_count	Claude	int	Needs artifact	Heuristic count of assumption-signaling phrases in the first reasoning block: "I'll assume", "assuming", "I think you mean", "probably wants", "likely refers to".	⊕ High assumption count on an underspecified prompt is normal. High on a detailed prompt is a red flag. Attribution depends on prompt quality score.
prompt_underspecification anticipation.prompt_underspecification	Derived	float	Not extracted	How vague was the original prompt? Composite of: prompt token length, absence of specific file/function references, absence of expected outcome description. Gates attribution — a low score means frustration signals are unreliable as agent quality labels.	⊕ Critical gating signal. Without this, frustration from the user gets misattributed to agent failure.
conflation_turn anticipation.conflation_turn	LLM judge	int	Not extracted	The turn index at which the agent's scope started expanding beyond the original request. `null` if scope stayed clean.	⊕ Early conflation (turn 2-3) = agent misread scope from the start. Late conflation (turn 10+) = task boundaries dissolved during execution. Different failure modes.
instruction_adherence ★ anticipation.instruction_adherence TRAIL annotation schema · RECAP	LLM judge	enum	Not extracted	Did the agent follow the instruction as given? Values: `full` / `partial` / `ignored` / `contradicted`. Distinct from correctness.	✓ TRAIL provides human-annotated reference labels across 148 traces. Gemini-2.5-Pro achieves only 11% on trace debugging — instruction adherence scoring needs careful judge design.
goal_alignment_at_close anticipation.goal_alignment_at_close	LLM judge	float	Not extracted	Semantic similarity between the user's original request and what the agent actually delivered at session end. Low score with high quality output = solved something, but not what was asked.	⊕ The "taste" signal. Agents can produce correct code that misses the product goal entirely. Senior engineers catch this; junior engineers don't.

Session Dynamics — how understanding evolved across turns

4 no-LLM 9 need embeddings or session JSONL ▼

Anticipation catches failures at turn 0. Session dynamics catches everything that happens after — drift, locking, spiraling, scope bleed, goal shift. Research (arxiv 2505.02709, 2602.07338) shows these are distinct failure modes: an agent can start correctly and drift, or start wrong and compound.

Applicability: These signals are product-session signals. In deterministic one-shot evals (SWE-bench), drift/assumption_lock barely registers — from our eval data the dominant failure is wrong localization (~44%), not conversational drift. The “~30% performance drop” (arxiv 2602.07338) is from multi-turn chat agents and applies here, not to the eval tab.

Most of these signals are computable without an LLM judge — they require an embedding model and turn-level session JSONL only.

session_failure_mode — six distinct failure types (from RECAP + goal drift literature)

clean

No meaningful drift. Context pollution stayed low. Agent updated its model correctly as turns progressed. User accepted without heavy steering.

shifted_intent

User's goal actually changed mid-session — not agent failure. Detectable when: user embeds at turn N are semantically distant from turn 0 in a direction the agent couldn't have anticipated. Not a clean training signal for agent quality.

assumption_locked

Agent locked in wrong assumption early and stopped updating. From arxiv 2602.07338: models "maximize the most statistically probable intent" rather than tracking user corrections. User corrections arrive but agent keeps executing the original wrong interpretation.

scope_bleed

Agent started touching things it was never asked to touch. scope_drift > 2.0. Tool sequence goes wide. Conflation turn is identifiable. Often looks like thoroughness.

stuck_loop

Agent repeated the same tool calls without making progress. High repetition_rate. Same grep/read pattern 3–5 times. Context window growing but no new files touched.

multi_intent_confusion

Agent tried to address multiple tangled goals simultaneously. From RECAP: "multiple distinct goals presented together." Plan graph has high node count from turn 1. Scope is wide from the start, not from drift.

Computable without an LLM judge — tools and formulas NO LLM REQUIRED

Context Pollution

CP_score

CP = 1 − cosine_sim(embed(turn_0), embed(turn_N))

CP > 0.45 = severe misalignment, re-anchor required. Early sharp rise = started wrong. Gradual rise = natural. From Kurtis Kemple / getmaxim.ai research.

sentence-transformers, numpy

Plan Edit Distance

plan_ged

GED(plan_turn_1, plan_turn_N)

Graph Edit Distance between extracted plan graphs. High GED = agent rewrote its approach entirely. From RECAP (arxiv 2509.04472).

networkx.graph_edit_distance

Correction Ratio

correction_ratio

corrections / total_user_turns

Count user turns containing: "no", "that's not", "wrong", "undo", "revert", "actually", "wait", "stop". Pure regex. No embedding needed.

regex on session JSONL user turns

Repetition Rate

repetition_rate

near_dup_calls / total_tool_calls

Tool calls with cosine_sim > 0.95 to a prior call = near-duplicate. High rate = stuck loop.

sentence-transformers on tool args

Intent Coherence

intent_coherence_curve[]

cosine_sim(embed(task), embed(agent_turn_N))

How aligned is each agent response with the original task? Sudden drop = agent drifted. Plateau at low value = locked on wrong thing.

sentence-transformers, per-turn

Msg Length Trend

user_msg_length_slope

linregress(turn_idx, user_msg_lengths)

Negative slope = user giving up. Positive slope = user re-explaining. Flat = healthy. No model needed — pure token counts.

scipy.stats.linregress

Signal	Source	Type	Status	Description	Validity / Research Backing
context_pollution_curve ★ dynamics.cp_curve[] getmaxim.ai · arxiv 2602.07338 · pi-mono	Embedding	list[float]	Needs session JSONL	Per-turn `CP = 1 − cosine_sim(turn_0_embed, turn_N_embed)`. Tracks semantic distance from the original task anchor across every turn.	✓ No LLM needed — any sentence-transformer works. CP > 0.45 defined as severe misalignment in production systems. Shape of the curve distinguishes "started wrong" from "drifted mid-session."
max_context_pollution dynamics.max_cp	Embedding	float	Needs session JSONL	Peak context pollution score across all turns. > 0.45 = severe. > 0.2 = moderate.	✓ Scalar summary of the full curve. Useful for session-level bucketing.
drift_onset_turn dynamics.drift_onset_turn	Embedding	int	Needs session JSONL	First turn where CP > 0.2. Turn 1–2 = agent misread from start. Turn 5–10 = mid-session drift. Turn > 15 = late drift, possibly scope bleed after partial success.	✓ Early vs. late onset separates "started wrong" from "drifted" — two different failure modes that demand different fixes.
correction_ratio ★ dynamics.correction_ratio COLING 2025 frustration detection · pi-mono · hermes	Regex	float	Needs session JSONL	Fraction of user turns containing correction signals: "no", "that's not", "wrong", "undo", "revert", "actually", "wait", "stop". Pure regex — no model needed.	⊕ COLING 2025 confirms keyword-based approaches miss frustrated users who don't use explicit negations. Must supplement with message length trend and context pollution.
assumption_lock_turn ★ dynamics.assumption_lock_turn arxiv 2602.07338 (Lost in Conversation) · hermes-agent-reasoning-traces	Embedding	int	Needs session JSONL	Turn where agent stopped updating its working model despite user corrections. Signature: correction_ratio rises but intent_coherence_curve flatlines.	⊕ ~30% performance drop in multi-turn chat agents (arxiv 2602.07338) traced to this pattern. Caveat: this stat is from chat agents, not coding evals. In SWE-bench one-shot evals, assumption lock barely exists — the agent doesn't receive user corrections. This signal is valid for product sessions only.
gd_inaction_score dynamics.gd_inaction	Derived	float	Needs session JSONL	From arxiv 2505.02709: measures agent's failure to abandon a wrong approach even after evidence it isn't working. Files/functions in the initial plan that were never touched divided by total initial plan items.	✓ Research paper tested on Claude 3.5 Sonnet — even top models show nonzero GD_inaction. Complementary to GD_actions.
gd_actions_score dynamics.gd_actions	Derived	float	Needs session JSONL	From arxiv 2505.02709: fraction of agent actions directed at areas outside the original task scope. High = agent actively investing effort in the wrong direction.	⊕ Requires knowing the "correct" scope — in eval runs this comes from the gold patch. In live sessions needs LLM judge.
plan_graph_edit_distance dynamics.plan_ged	Derived	float	Needs session JSONL	Graph Edit Distance between the plan from the first reasoning block and the plan at session close. From RECAP (arxiv 2509.04472). High GED = agent rewrote its approach entirely.	✓ No LLM judge needed — uses networkx GED + BERTScore for node label matching. Captures plan revision that correction_ratio misses.
repetition_rate dynamics.repetition_rate	Embedding	float	Needs session JSONL	Fraction of tool calls with cosine_sim > 0.95 to a prior call in the same session. The "stuck loop" signal.	✓ Pure behavioral signal. Repetition is bad in all cases. Can use exact-match on (tool_name, args) as lighter proxy.
user_msg_length_slope dynamics.user_msg_length_slope	Structural	float	Needs session JSONL	Linear regression slope over user message token lengths across turns. Negative = user giving up. Positive = user re-explaining. Near-zero = healthy. No model needed.	✓ Zero model dependency. Negative slope is the clearest signal of silent session abandonment, which correction_ratio cannot detect.
session_failure_mode dynamics.session_failure_mode	Derived	enum	Needs all above	Composite classification: `clean` / `shifted_intent` / `assumption_locked` / `scope_bleed` / `stuck_loop` / `multi_intent_confusion`. Decision tree over CP curve shape, correction_ratio, repetition_rate, and drift_onset_turn.	⊕ This is the label that makes all other signals interpretable. The classification logic is deterministic once input signals are computed.

User Session — raw signals from the human side of the conversation

0 extracted 18 need session JSONL ▼

Comprehensive inventory of signals observed across real session datasets — pi-mono, hermes-agent-reasoning-traces, swival, TRAIL, and the OpenTraces schema. Collect everything raw. Interpret nothing at collection time. Timestamps are timestamps. Message lengths are lengths. What they mean is analysis, not collection. Use ★ High value only above to filter to signals with the strongest signal-to-noise backing.

session_outcome — how the session ended (gating signal for all satisfaction inference)

accepted

User explicitly accepted output — applied a patch, ran the code, said "thanks", closed with a positive signal. Strongest quality label available from a live session.

restarted_same_task

New session opened on the same codebase/file within a short window after this one ended. Strong signal that the previous session did not fully deliver.

abandoned_mid_session

Session ended with no acceptance signal and no explicit rejection. User stopped engaging. Raw — do not label as failure without corroborating signals.

explicitly_rejected

User explicitly reverted all changes, said "undo everything", or dismissed the session. Clearest failure signal available from the user side.

Signal	Source	Type	Description	Notes
Outcome & Acceptance
session_outcome ★ user.session_outcome OpenTraces outcome block · pi-mono session events	Structural	enum	How the session ended. See enum above. Gating label — without this, no satisfaction signal is interpretable.	⊕ Hardest single signal to get right. "Abandoned" is ambiguous by definition. Collect the raw end-state and derive the label later.
output_applied ★ user.output_applied OpenTraces git attribution · pi-mono	Structural	bool	Did the user apply or commit the agent's output? Detectable from git state at session close. Strongest quality proxy without asking the user.	✓ Completely objective. Git commit after session = output was accepted.
changes_reverted user.changes_reverted	Structural	bool	Were agent-made changes reverted before session end? Detected via git diff between mid-session peak and session-close state.	✓ Objective. Complementary to `output_applied`.
Turn Structure
user_agent_turn_ratio ★ user.turn_ratio hermes-agent-reasoning-traces · pi-mono · TRAIL	Structural	float	User turns divided by agent turns. Ratio near 1.0 = back-and-forth. Ratio < 0.5 = agent dominated. Ratio > 1.5 = user steering.	✓ Pure count from session JSONL.
user_turn_count user.user_turn_count	Structural	int	Raw count of user turns. Baseline denominator for all ratio signals.	✓ Trivially extracted from session JSONL message roles.
user_msg_lengths user.user_msg_lengths[]	Structural	list[int]	Token length of each user message in order. Raw list — do not interpret at collection.	✓ Collect raw. Analysis can later derive slope, variance, drop-off patterns.
all_turn_timestamps user.turn_timestamps[]	Structural	list[str]	ISO timestamp for every turn (both user and agent), in order. Raw — collect everything.	⊕ Do not label gaps as "gave up" at collection time. Timestamps are facts.
Code Region Signals
region_edit_history ★ user.region_edit_history[] OpenTraces attribution (line ranges) · pi-mono branch summaries	Structural	list[obj]	Per-edit record: `{file, start_line, end_line, turn_index, timestamp}`. Foundation for all "same area" analysis.	✓ Most important missing signal. File-level is too coarse. Function/line-range is the right unit.
region_re_edit_count ★ user.region_re_edit_count derived from OpenTraces line-range attribution	Derived	int	Count of distinct code regions edited more than once. Derived from `region_edit_history` using overlap detection.	✓ Direct answer to "multiple edits over the same areas." Requires `region_edit_history` first.
region_edit_convergence ★ user.region_edit_convergence[] derived from OpenTraces line-range attribution	Derived	list[str]	For each multiply-edited region: was the diff size shrinking (converging) or growing/oscillating (diverging)? Values: `converging` / `oscillating` / `expanding`.	⊕ Converging re-edits are fine — iterative refinement. Oscillating is the failure signal. Without this distinction, re-edit count alone is misleading.
Agent Behavior Toward User
agent_hedging_rate user.agent_hedging_rate	Derived	float	Frequency of hedging phrases: "I think", "might be", "probably", "I'm not sure". Rate per 100 tokens.	✓ Pure regex. A rising trajectory is more informative than the average.
agent_hedging_curve ★ user.agent_hedging_curve[] hermes-agent-reasoning-traces · pi-mono assistant messages	Derived	list[float]	Hedging rate per agent turn in order. Collect raw — the trajectory shape is the signal.	✓ Collect raw list. Same philosophy as `user_msg_lengths`.
confirmation_requests user.confirmation_requests	Structural	int	Times the agent asked the user to confirm before proceeding. "Should I...", "Do you want me to...".	⊕ High on complex operations is correct behavior. High on trivial operations = agent not confident.
time_to_first_edit_s ★ user.time_to_first_edit_s OpenTraces attempt-span schema · pi-mono timestamps	Structural	float	Seconds from session start to first file modification. Raw number — fast is not always better.	✓ Derivable from timestamps. No interpretation baked in.
per_turn_agent_latency_s ★ user.per_turn_latency_s[] swival OpenTraces-compatible traces · pi-mono turn metadata	Structural	list[float]	Wall-clock seconds per agent turn. Collect the full series.	✓ Directly from turn timestamps in session JSONL.
Session Infrastructure
compaction_events ★ user.compaction_events[] pi-mono compaction summaries · OpenTraces session lifecycle	Claude	list[obj]	Context window compaction events. Each event: `{turn_index, tokens_before, tokens_after, timestamp}`.	⊕ Compaction loses information. Relevant for understanding late-session drift and instruction adherence failure.
model_changes user.model_changes[]	Claude	list[obj]	Model switches mid-session: `{from_model, to_model, turn_index}`.	✓ Structured event in session JSONL. Zero inference needed.
tool_error_rate ★ user.tool_error_rate hermes-agent-reasoning-traces tool_response · TRAIL execution errors	Claude	float	Fraction of tool calls that returned an error result. Also collect as `tool_errors_by_type[]` per tool.	⊕ Some tool errors are expected (probing). High consecutive errors on the same tool/args = stuck.
tool_result_sizes user.tool_result_sizes[]	Claude	list[int]	Token length of each tool call result in order. Large results contribute disproportionately to context growth.	✓ Combined with `context_growth_curve` explains which tool calls caused context bloat.
Multi-Session Context
prior_sessions_same_repo user.prior_sessions_same_repo	Structural	int	Count of prior sessions on the same repository.	✓ Derivable from session metadata.
prior_sessions_same_region ★ user.prior_sessions_same_region OpenTraces cross-session attribution · pi-mono session metadata	Derived	int	Count of prior sessions that touched the same file+line-range as this session. A region touched in 5 separate sessions is a signal nothing else surfaces.	✓ Requires cross-session join on region_edit_history.
session_reopened_within user.session_reopened_within_s	Structural	float	Seconds until the same user opened a new session on the same repo after this one ended. Null if no new session within 24h.	⊕ Collect raw. Whether a quick reopen means the first session failed is analysis, not collection.