AGENT SIGNALS
Feature Engineering Inventory
Filter
Raw traces → Structured attempts → Behavioral features → Outcome labels → Optimization
Two contexts, different failure modes. Eval signals are deterministic and test-suite-verified. Product signals are multi-turn and user-in-the-loop. This inventory separates them because signals that are always-null in one context create noise in the other.
Attempt-Span Schema — trial › attempt_span[] Each attempt = one uninterrupted agent run within a trial
treatment
attempt1_tool_cap
retry_tool_cap
agent_type
model_id
dataset
concurrency
behavior
turn_count
tool_call_count
edit_calls
bash_mediated_edits
touched_files
diff_lines
tool_sequence[]
localization_mode
localized_correct_file
files_searched_before_edit
reasoning
has_reasoning_blocks
reasoning_tokens_total
plan_steps_detected
self_correction_count
phase_transitions[]
plannotator_phases[]
tokens
input_tokens_total
output_tokens_total
cache_read_tokens
cache_write_tokens
context_growth_curve[]
cost_usd
verification
reward
patch_outcome_label
fail_to_pass_count
pass_to_fail_count
pass_to_pass_count
fail_to_fail_count
patch_applies
infra
wall_time_s
timed_out
attempt_index
recovery_mode
final_failure_mode
Telemetry Gap — two distinct issues
1. JSONL not collected from Modal: telemetry.sessions: true is set, so the agent writes per-turn JSONL to /root/.subq/agent/sessions/ inside each container — but this directory is never collected as a Modal artifact. <think> tokens and per-turn token breakdown are unavailable from eval runs.

2. Plannotator phases ARE available now: plannotator-phase events emit to events/stdout in real time during eval runs. These are not blocked by the JSONL gap — they can be parsed from stdout.txt today. Do not conflate these two issues.
Fix JSONL gap — add to artifact config
{ "source": "/root/.subq/agent/sessions",
  "destination": "sessions",
  "type": "directory" }
Attempt-1 vs. retry performance must be tracked separately. Autoresearch operates on this split. Aggregate metrics across attempts mask whether improvements come from better first-pass performance or better recovery. All signals in this tab should be indexed by attempt_index. Treatment variables (attempt1ToolCap, retryToolCap) differ between attempts — without recording them per-attempt, fair attribution is impossible.
Treatment Variables — the knobs applied to this run
These are the independent variables — the treatment applied to each run. If you're comparing agent candidates, these must be recorded per-attempt or you can't do fair attribution. Currently these exist as harness config but are not persisted in result.json.
Missing entirely from the schema. Without treatment variables recorded per-attempt, you cannot distinguish "agent A is better than agent B" from "agent A had more tool calls allowed." This is the most critical gap for autoresearch comparisons.
Signal Source Type Status Description Why it matters
attempt1_tool_cap
treatment.attempt1_tool_cap
Harness int Not recorded
Maximum tool calls allowed for the first attempt. From harness config attempt1ToolCap. This is the primary knob for controlling agent exploration depth.
An agent that resolved a task with 50 tool calls allowed cannot be fairly compared to one with 20. Must be recorded.
retry_tool_cap
treatment.retry_tool_cap
Harness int Not recorded
Maximum tool calls allowed for retry attempts. Different from attempt1 — retries typically get more budget. From harness config retryToolCap.
Retry performance is meaningless without knowing the retry budget. Two agents with different retry caps cannot be compared.
agent_type
treatment.agent_type
Harness str Not recorded
Which agent scaffold was used: one-shot, subq, shotgun, mini-swe-agent, etc. From --agent CLI flag. Already in eval-history.jsonl but not in per-attempt result.json.
Must be in the attempt-span schema, not just the run-level log.
model_id
treatment.model_id
Harness str Not recorded
Exact model identifier used for this attempt. From AGENT_MODEL env or config. Include the full string, not just the provider — anthropic/claude-sonnet-4 vs. claude-sonnet-4-20250514 matters.
Without this, you can't separate model quality from scaffold quality.
dataset
treatment.dataset
Harness str Not recorded
Dataset and version: swebench-verified@1.0, swesmith@1.0, etc. Different datasets have different difficulty distributions.
Same agent, same model — different resolve rate on different datasets. Must be recorded for any cross-run comparison.
concurrency
treatment.concurrency
Harness int Not recorded
Number of parallel containers for this run. Affects inference server load, which affects wall_time_s. From --concurrency flag.
Explains variance in wall_time_s across runs with different concurrency levels.
timeout_s
treatment.timeout_s
Harness int Not recorded
Container timeout in seconds. Different timeout limits change the feasible set of tasks — Sphinx tasks need 50+ min.
A run with 30 min timeout will have a structurally lower resolve rate on Sphinx tasks than one with 60 min.
Behavioral Trajectory — how the agent acts
Captures what the agent did: which tools it called, in what order, which files it touched, and how the controller classified its localization strategy. These signals come from stdout.txt [subq-eval] lines and result.json. Highest reliability — source-of-truth from the harness.

Eval-specific context: From 500 tasks at 46.7% resolved — dominant failure mode is wrong localization or incomplete patch (~44%), regressions are ~4%, infra failures ~6%. Localization quality is the single biggest lever for improvement.
Typical tool sequences observed in eval traces
Resolved
Read Bash(grep) Read Reason Edit Run tests
bash-mediated edit
Read Bash(grep) Bash(sed -i) Bash(python) editCalls=0, touchedFiles≠∅ — invisible to edit counter
Failed / no patch
Read Bash(grep) Bash(grep) Bash(grep) many reads, no edits
Signal Source Type Status Description Validity / Pitfalls
turn_count
behavior.turn_count
Harness int Extracted
Number of agent turns before termination. Parsed from [subq-eval] Turns: N.
Poor proxy alone. Turn count conflates task difficulty with agent efficiency. A 3-turn success on a hard task ≠ a 3-turn failure on an easy one. Use conditioned on task difficulty.
tool_call_count
behavior.tool_call_count
Harness int Extracted
Total tool calls made across the attempt. From [subq-eval] Tool calls: N.
⊕ Counts bash and read equally. A 40-call success with 38 reads looks the same as 40 edits. Decompose by tool type for signal quality.
edit_calls
behavior.edit_calls
Harness int Extracted
Calls to the edit / str_replace_editor tool specifically. From [subq-eval] Edit calls: N.
✓ More targeted than total tool calls. But incomplete — does not capture bash-mediated writes (see bash_mediated_edits).
bash_mediated_edits
behavior.bash_mediated_edits
pi-mono · swival · OpenTraces
Harness bool Extracted
True when edit_calls = 0 but touched_files ≠ ∅. Agent modified files via bash (sed -i, heredocs) — invisible to the edit counter.
⊕ Common failure pattern in evaluated traces. These patches are structurally different (no diff capture at call time). Worth flagging as a separate behavioral cluster.
touched_files
behavior.touched_files[]
Harness list[str] Extracted
File paths modified during the attempt. From [subq-eval] Touched files: path (N diff lines). Includes bash-modified files.
✓ Reliable. Derived from diff against pre-attempt snapshot, not from tool call logging. Gold standard for "did the agent change anything."
diff_lines
behavior.diff_lines
Harness int Extracted
Total lines changed across all touched files. Sum of per-file diff line counts from stdout.
⊕ Large diffs aren't always better. Reformatting or whitespace changes inflate this. Correlates weakly with correctness alone.
tool_sequence
behavior.tool_sequence[]
hermes-agent-reasoning-traces · pi-mono · OpenTraces TAO loop
Harness list[str] Not extracted
Ordered list of tool names called. Enables bigram/trigram analysis: read→edit, bash→bash→bash (stuck loop), edit→bash(test) (verify-after-edit).
⊕ Requires parsing full stdout interleave, not just summary lines. High value for behavioral clustering. Not yet implemented.
localization_mode
behavior.localization_mode
Harness enum Extracted
Controller classification of how the attempt was located: attempt1Mode, trialRecoveryMode, finalFailureMode. From LocalizationSignal on HarborResult.
✓ Structural signal from the harness controller, not the agent. Clean separation of concerns.
Localization Quality — wrong file = no chance (NEW)
localized_correct_file
behavior.localized_correct_file
Derived bool Not extracted
Did the agent's first edit target a file that appears in the gold patch? touched_files[0] ∈ gold_patch_files. Wrong file on first edit = localization failure, distinct from patch quality failure.
✓ Separates "found the right place but wrote the wrong fix" from "never found the right place." These have completely different root causes: the first is a reasoning failure, the second is a search/codebase-nav failure. Derivable from existing data + gold patch.
files_searched_before_edit
behavior.files_searched_before_edit
Harness int Not extracted
How many distinct files the agent read/grepped before making its first edit. Proxy for localization thoroughness.
⊕ Requires stdout parsing. Over-exploration (high reads, delayed edit) correlates with uncertainty and lower patch quality in SWE-bench settings. But 0 reads before edit = blind patching, also bad.
Derived
files_read_before_edit
behavior.files_read_before_edit
Harness int Not extracted
How many distinct files the agent read (Read tool only) before making its first edit. Narrower than files_searched_before_edit.
⊕ Requires stdout parsing.
edit_to_read_ratio
behavior.edit_to_read_ratio (derived)
Derived float Not extracted
Ratio of edit calls (including bash-mediated) to read calls. High ratio = decisive agent; low ratio = over-exploratory or stuck.
⊕ Needs tool_sequence first. Good discriminator between behavioral clusters once sequence data exists.
Reasoning Signals — how the agent thinks
Signals derived from the agent's internal reasoning: extended thinking blocks, planning phases, self-corrections.

Availability split: Plannotator phases emit to stdout in real time and are available now. All other reasoning signals require Claude session JSONL (~/.subq/agent/sessions/*.jsonl), which is not collected from Modal eval runs.
Signal Source Type Status Description Validity / Pitfalls
Available now — from stdout events
plannotator_phases
reasoning.plannotator_phases[]
pi-mono custom_message events · hermes-agent-reasoning-traces
Stdout list[obj] Available
Plannotator event objects from stdout: {"event":"plannotator-phase","phase":"implement"}. Ground-truth phase timeline. Emits in real time — does NOT require JSONL collection.
✓ Structured, agent-emitted, no inference required. Best leading indicator of agent trajectory quality. Parse from stdout.txt today.
phase_transitions
reasoning.phase_transitions[]
Stdout list[str] Available
Sequence of detected phase labels: explorelocalizeimplementverify. Derived from plannotator_phases. Already available from stdout.
✓ Zero parsing heuristic needed — purely structural once plannotator events are collected.
Blocked — requires session JSONL from Modal
has_reasoning_blocks
reasoning.has_reasoning_blocks
Claude bool Needs artifact
Whether the model emitted extended thinking blocks (<think> tokens) in any turn. In session JSONL: message events with type: "thinking" content blocks.
⊕ Binary presence is weak; length distribution is stronger.
reasoning_tokens_total
reasoning.reasoning_tokens_total
Claude int Needs artifact
Sum of output tokens attributed to thinking blocks across all turns. Requires summing content[].text.length for thinking blocks from JSONL.
⊕ Reasoning length has diminishing returns and can anti-correlate with performance on straightforward tasks (overthinking).
plan_steps_detected
reasoning.plan_steps_detected
Claude int Needs artifact
Heuristic count of numbered/bulleted plan steps in the first assistant turn's text. Proxy for whether the agent formed an explicit plan before acting.
⊕ Regex-based, fragile. Agents can reason implicitly without numbered steps. Use as soft signal, not hard feature.
self_correction_count
reasoning.self_correction_count
Claude int Needs artifact
Times the agent explicitly revisited or abandoned a prior approach (“actually…”, “wait, that won't work…”). Detectable from reasoning text in session JSONL.
Direction is ambiguous in coding eval context. Correction cycles often correlate with confusion and looping, not capability. In SWE-bench data, high self-correction on failed tasks = stuck agent revisiting the same wrong approach. Must be paired with reward before interpreting. Useful signal but not unconditionally positive.
Verification Layer — what is objectively correct
Ground-truth outcome signals from the verifier. These are the most reliable signals in the inventory — computed against the task's test suite, not inferred from the agent trace. Source: verifier/report.json and result.json.
patch_outcome_label — replaces 4 boolean columns
correct_patch
reward = 1.0. All FAIL_TO_PASS tests now pass, no PASS_TO_FAIL regressions.
partial_patch
Some FAIL_TO_PASS tests pass but not all, OR FAIL_TO_PASS > 0 with PASS_TO_FAIL > 0. Meaningful progress, incomplete fix.
regressive_patch
reward = 0. PASS_TO_FAIL > 0. Agent broke previously-passing tests without fixing the target failure.
no_effect_patch
touched_files is non-empty but no test result changed. Patch was syntactically valid but semantically inert.
unverifiable
Patch did not apply cleanly, or verifier crashed/timed out. Cannot classify.
Signal Source Type Status Description Validity / Pitfalls
reward
verification.reward
TRAIL · swival · SWE-bench
Harness float Extracted
Binary reward: 1.0 = all FAIL_TO_PASS tests pass with no regressions. 0.0 otherwise. From result.json.
✓ Ground truth for autoresearch metric. But binary — misses partial progress. Use alongside patch_outcome_label for richer signal.
patch_outcome_label
verification.patch_outcome_label
Derived enum Not extracted
5-value enum derived from verifier test counts. See enum block above. Computed from verifier/report.json FAIL_TO_PASS / PASS_TO_FAIL counts.
✓ Discriminates partial progress and regressions that reward=0 collapses together. Critical for failure analysis and cluster labeling.
fail_to_pass_count
verification.fail_to_pass_count
TRAIL · SWE-bench verifier
Harness int Extracted
Tests that were failing before the patch and pass after. The primary "fixed" signal. From verifier/report.json → FAIL_TO_PASS.
✓ Gold signal. Task-level ground truth. Use as the numerator for partial fix rate.
pass_to_fail_count
verification.pass_to_fail_count
TRAIL · SWE-bench verifier
Harness int Extracted
Tests that were passing before and fail after — regressions introduced by the patch. From verifier/report.json → PASS_TO_FAIL.
✓ Any non-zero value is a quality signal. Regressions should be weighted heavily in any scoring function.
pass_to_pass_count
verification.pass_to_pass_count
Harness int Extracted
Tests that passed before and pass after — preserved behavior. From verifier/report.json → PASS_TO_PASS.
⊕ High counts expected on every run. Useful only as denominator for regression rate, not as a quality signal by itself.
fail_to_fail_count
verification.fail_to_fail_count
Harness int Extracted
Tests that failed before and still fail after — no progress on these. From verifier/report.json → FAIL_TO_FAIL.
⊕ High count means partial fix or complete miss. Good for diagnosing scope of remaining failures.
patch_applies
verification.patch_applies
Harness bool Extracted
Whether the generated patch applied cleanly to the codebase. False → unverifiable outcome. From result.json.
✓ Pre-condition for all other verification signals. Must gate the entire verification layer.
Cost & Infra — what it costs to get there
Token economics, cache utilization, wall time, and infrastructure-level signals. Harness-side signals (wall time, timeout) are already collected. Token-level signals require session JSONL from Modal containers.
Signal Source Type Status Description Validity / Pitfalls
wall_time_s
infra.wall_time_s
Harness float Extracted
Wall clock time for the attempt in seconds. From result.json → duration.
⊕ Affected by container cold start, inference server load, concurrency. Not pure agent time. Use as relative comparator within a run, not across runs.
timed_out
infra.timed_out
Harness bool Extracted
Whether the container hit its timeout limit before completing. Sphinx tasks routinely timeout at 50 min. From result.json.
⊕ Timeout rate is task-dependent (Sphinx >> Django on verified-mini). Must segment by task set before using as agent quality signal.
attempt_index
infra.attempt_index
Harness int Extracted
Which attempt within the trial this was (0-indexed). Combined with localization_mode, indicates whether recovery was triggered.
✓ Essential for decomposing trial-level signals into attempt-span signals. Required for the attempt-span schema to be meaningful.
input_tokens_total
tokens.input_tokens_total
Claude int Needs artifact
Total input tokens consumed across all turns. In session JSONL: sum of usage.input per message event.
⊕ Grows monotonically (context accumulation). Context growth curve is more informative than total.
cache_read_ratio
tokens.cache_read_ratio
Derived float Needs artifact
cacheRead / (input + cacheRead) per turn. From session JSONL usage.cacheRead. High ratio = efficient context reuse.
✓ Direct cost efficiency proxy. Cache reads are ~10x cheaper than input tokens on Claude. Low ratio on long attempts suggests cache invalidation or poor prompt structure.
cost_usd
tokens.cost_usd
Claude float Needs artifact
Estimated total USD cost of the attempt. From session JSONL usage.cost object. Aggregated across all turns.
⊕ Cost per resolved task (cost / reward) is the key metric for optimization. Raw cost without outcome is incomplete.
context_growth_curve
tokens.context_growth_curve[]
OpenTraces tokens schema · pi-mono session traces
Derived list[int] Needs artifact
Input token count at each turn. From session JSONL: [usage.input for each message event in order]. Shows context window growth trajectory.
⊕ Confirmed in local sessions: 9k → 45k input tokens over ~15 turns. Fast growth = agent accumulating large context or codebase dumps. Plateau = efficient summarization.
Different context, different failure modes
These signals apply to multi-turn, user-in-the-loop sessions. Drift, assumption lock, and scope bleed are the dominant failure modes here — they barely register in deterministic one-shot SWE-bench evals. Signals in this tab will be always-null for eval runs and should not be collected there. Conversely, eval-tab signals like patch_outcome_label and fail_to_pass_count do not exist in product sessions (no test harness).

Stat caveat (arxiv 2602.07338): The “~30% performance drop from drift” finding is from multi-turn chat agents (Lost in Conversation). For coding evals the number is almost certainly wrong — from our own eval data (500 tasks, 46.7% resolved), wrong localization/incomplete patch dominates (~44%), regressions are ~4%, infra failures ~6%. Drift in the chat-agent sense barely registers in SWE-bench. The stat is valid only for this tab's context.
Reference Datasets
PatronusAI/TRAIL — 148 annotated traces, plan_optimality + instruction_adherence labels lambda/hermes-agent-reasoning-traces — 14.7k samples, <think> blocks, 24 avg turns badlogicgames/pi-mono — 627 sessions, tree-structured, branch summaries jedisct1/agent-traces-swival — 10.7k rows, security audits, OpenTraces-compatible OpenTraces — TAO loop schema v0.3, evidence tiers, git attribution
Agent Anticipation — did it understand what was actually needed
Did the agent understand the task from the first turn? Did it anticipate what the user actually needed, or did it make assumptions, drift scope, or misread intent? These signals are derived post-session — they require either a session JSONL artifact or an LLM-as-judge pass over the conversation. They answer a fundamentally different question than verification: not was the output correct, but did the agent understand what correct meant.

Key tension: user frustration ≠ agent failure. A clear prompt that the agent misread is agent failure. An underspecified prompt where the agent made a reasonable guess is attribution-unclear. These buckets must be separated before using frustration signals as training signal.
session_attribution — why did the session go wrong (or right)
correct_anticipation
Agent understood intent from turn 1. First plan matched eventual solution. Minimal course correction. User accepted without steering. This is taste — knowing what the user needed before they fully articulated it.
scope_conflation
Task boundaries dissolved mid-session. Adjacent concerns got folded in — agent started solving related but un-asked problems. Output expanded beyond the original request. Often looks like thoroughness but is actually drift.
agent_misunderstood
Prompt was clear; agent paraphrased incorrectly in first reasoning block or dove in the wrong direction. Detectable when: user correction arrives early, agent's first tool call goes to unrelated area, plan-to-solution divergence is high.
underspecified_prompt
Prompt was too vague for any agent to resolve without guessing. Agent's guess may have been reasonable. Frustration from user here is not a clean training signal — attributing it to agent failure contaminates the label. Detectable by: prompt length, absence of context, agent asking clarifying questions.
Signal Source Type Status Description Validity / Pitfalls
task_paraphrase_accuracy
anticipation.task_paraphrase_accuracy
LLM judge float Not extracted
Semantic similarity between the user's original request and the agent's restatement of the task in its first reasoning block or response. Low score = agent misread intent from the start.
⊕ Requires session JSONL + LLM judge. TRAIL dataset provides human-annotated instruction_adherence labels as a reference schema. First-turn restatement is the earliest detectable signal of understanding failure.
plan_optimality
anticipation.plan_optimality
TRAIL (human-annotated) · hermes-agent-reasoning-traces
LLM judge enum Not extracted
Was the initial plan the agent formed close to optimal for the task? Drawn from TRAIL's annotation schema. Values: optimal / suboptimal / wrong_direction. Compare first reasoning block plan to eventual solution path.
✓ TRAIL benchmark shows best models achieve only 11% accuracy detecting planning errors — this is genuinely hard. High-value signal precisely because it's hard to fake.
first_tool_precision
anticipation.first_tool_precision
Harness bool Not extracted
Did the agent's first substantive tool call go to a file that ended up in touched_files? True = agent localized correctly from the start.
✓ Derivable from existing harness data once tool_sequence is extracted. Clearest behavioral signal of genuine task understanding vs. exploratory guessing.
scope_drift
anticipation.scope_drift
Derived float Not extracted
Ratio of files touched to files minimally required. Score of 1.0 = minimal footprint. Score of 3.0 = agent touched 3x more than needed.
⊕ "Minimally required" needs LLM judge in live sessions. For eval runs: compare touched files to the gold patch's file set.
clarification_sought
anticipation.clarification_sought
Claude bool Needs artifact
Did the agent ask a clarifying question before acting? First assistant turn contains a question directed at the user rather than a plan or tool call.
⊕ Context-dependent. On a vague prompt it's correct behavior. On a clear prompt it signals the agent didn't read carefully.
assumption_count
anticipation.assumption_count
Claude int Needs artifact
Heuristic count of assumption-signaling phrases in the first reasoning block: "I'll assume", "assuming", "I think you mean", "probably wants", "likely refers to".
⊕ High assumption count on an underspecified prompt is normal. High on a detailed prompt is a red flag. Attribution depends on prompt quality score.
prompt_underspecification
anticipation.prompt_underspecification
Derived float Not extracted
How vague was the original prompt? Composite of: prompt token length, absence of specific file/function references, absence of expected outcome description. Gates attribution — a low score means frustration signals are unreliable as agent quality labels.
⊕ Critical gating signal. Without this, frustration from the user gets misattributed to agent failure.
conflation_turn
anticipation.conflation_turn
LLM judge int Not extracted
The turn index at which the agent's scope started expanding beyond the original request. null if scope stayed clean.
⊕ Early conflation (turn 2-3) = agent misread scope from the start. Late conflation (turn 10+) = task boundaries dissolved during execution. Different failure modes.
instruction_adherence
anticipation.instruction_adherence
TRAIL annotation schema · RECAP
LLM judge enum Not extracted
Did the agent follow the instruction as given? Values: full / partial / ignored / contradicted. Distinct from correctness.
✓ TRAIL provides human-annotated reference labels across 148 traces. Gemini-2.5-Pro achieves only 11% on trace debugging — instruction adherence scoring needs careful judge design.
goal_alignment_at_close
anticipation.goal_alignment_at_close
LLM judge float Not extracted
Semantic similarity between the user's original request and what the agent actually delivered at session end. Low score with high quality output = solved something, but not what was asked.
⊕ The "taste" signal. Agents can produce correct code that misses the product goal entirely. Senior engineers catch this; junior engineers don't.
Session Dynamics — how understanding evolved across turns
Anticipation catches failures at turn 0. Session dynamics catches everything that happens after — drift, locking, spiraling, scope bleed, goal shift. Research (arxiv 2505.02709, 2602.07338) shows these are distinct failure modes: an agent can start correctly and drift, or start wrong and compound.

Applicability: These signals are product-session signals. In deterministic one-shot evals (SWE-bench), drift/assumption_lock barely registers — from our eval data the dominant failure is wrong localization (~44%), not conversational drift. The “~30% performance drop” (arxiv 2602.07338) is from multi-turn chat agents and applies here, not to the eval tab.

Most of these signals are computable without an LLM judge — they require an embedding model and turn-level session JSONL only.
session_failure_mode — six distinct failure types (from RECAP + goal drift literature)
clean
No meaningful drift. Context pollution stayed low. Agent updated its model correctly as turns progressed. User accepted without heavy steering.
shifted_intent
User's goal actually changed mid-session — not agent failure. Detectable when: user embeds at turn N are semantically distant from turn 0 in a direction the agent couldn't have anticipated. Not a clean training signal for agent quality.
assumption_locked
Agent locked in wrong assumption early and stopped updating. From arxiv 2602.07338: models "maximize the most statistically probable intent" rather than tracking user corrections. User corrections arrive but agent keeps executing the original wrong interpretation.
scope_bleed
Agent started touching things it was never asked to touch. scope_drift > 2.0. Tool sequence goes wide. Conflation turn is identifiable. Often looks like thoroughness.
stuck_loop
Agent repeated the same tool calls without making progress. High repetition_rate. Same grep/read pattern 3–5 times. Context window growing but no new files touched.
multi_intent_confusion
Agent tried to address multiple tangled goals simultaneously. From RECAP: "multiple distinct goals presented together." Plan graph has high node count from turn 1. Scope is wide from the start, not from drift.
Computable without an LLM judge — tools and formulas NO LLM REQUIRED
Context Pollution
CP_score
CP = 1 − cosine_sim(embed(turn_0), embed(turn_N))
CP > 0.45 = severe misalignment, re-anchor required. Early sharp rise = started wrong. Gradual rise = natural. From Kurtis Kemple / getmaxim.ai research.
sentence-transformers, numpy
Plan Edit Distance
plan_ged
GED(plan_turn_1, plan_turn_N)
Graph Edit Distance between extracted plan graphs. High GED = agent rewrote its approach entirely. From RECAP (arxiv 2509.04472).
networkx.graph_edit_distance
Correction Ratio
correction_ratio
corrections / total_user_turns
Count user turns containing: "no", "that's not", "wrong", "undo", "revert", "actually", "wait", "stop". Pure regex. No embedding needed.
regex on session JSONL user turns
Repetition Rate
repetition_rate
near_dup_calls / total_tool_calls
Tool calls with cosine_sim > 0.95 to a prior call = near-duplicate. High rate = stuck loop.
sentence-transformers on tool args
Intent Coherence
intent_coherence_curve[]
cosine_sim(embed(task), embed(agent_turn_N))
How aligned is each agent response with the original task? Sudden drop = agent drifted. Plateau at low value = locked on wrong thing.
sentence-transformers, per-turn
Msg Length Trend
user_msg_length_slope
linregress(turn_idx, user_msg_lengths)
Negative slope = user giving up. Positive slope = user re-explaining. Flat = healthy. No model needed — pure token counts.
scipy.stats.linregress
Signal Source Type Status Description Validity / Research Backing
context_pollution_curve
dynamics.cp_curve[]
getmaxim.ai · arxiv 2602.07338 · pi-mono
Embedding list[float] Needs session JSONL
Per-turn CP = 1 − cosine_sim(turn_0_embed, turn_N_embed). Tracks semantic distance from the original task anchor across every turn.
✓ No LLM needed — any sentence-transformer works. CP > 0.45 defined as severe misalignment in production systems. Shape of the curve distinguishes "started wrong" from "drifted mid-session."
max_context_pollution
dynamics.max_cp
Embedding float Needs session JSONL
Peak context pollution score across all turns. > 0.45 = severe. > 0.2 = moderate.
✓ Scalar summary of the full curve. Useful for session-level bucketing.
drift_onset_turn
dynamics.drift_onset_turn
Embedding int Needs session JSONL
First turn where CP > 0.2. Turn 1–2 = agent misread from start. Turn 5–10 = mid-session drift. Turn > 15 = late drift, possibly scope bleed after partial success.
✓ Early vs. late onset separates "started wrong" from "drifted" — two different failure modes that demand different fixes.
correction_ratio
dynamics.correction_ratio
COLING 2025 frustration detection · pi-mono · hermes
Regex float Needs session JSONL
Fraction of user turns containing correction signals: "no", "that's not", "wrong", "undo", "revert", "actually", "wait", "stop". Pure regex — no model needed.
⊕ COLING 2025 confirms keyword-based approaches miss frustrated users who don't use explicit negations. Must supplement with message length trend and context pollution.
assumption_lock_turn
dynamics.assumption_lock_turn
arxiv 2602.07338 (Lost in Conversation) · hermes-agent-reasoning-traces
Embedding int Needs session JSONL
Turn where agent stopped updating its working model despite user corrections. Signature: correction_ratio rises but intent_coherence_curve flatlines.
~30% performance drop in multi-turn chat agents (arxiv 2602.07338) traced to this pattern. Caveat: this stat is from chat agents, not coding evals. In SWE-bench one-shot evals, assumption lock barely exists — the agent doesn't receive user corrections. This signal is valid for product sessions only.
gd_inaction_score
dynamics.gd_inaction
Derived float Needs session JSONL
From arxiv 2505.02709: measures agent's failure to abandon a wrong approach even after evidence it isn't working. Files/functions in the initial plan that were never touched divided by total initial plan items.
✓ Research paper tested on Claude 3.5 Sonnet — even top models show nonzero GD_inaction. Complementary to GD_actions.
gd_actions_score
dynamics.gd_actions
Derived float Needs session JSONL
From arxiv 2505.02709: fraction of agent actions directed at areas outside the original task scope. High = agent actively investing effort in the wrong direction.
⊕ Requires knowing the "correct" scope — in eval runs this comes from the gold patch. In live sessions needs LLM judge.
plan_graph_edit_distance
dynamics.plan_ged
Derived float Needs session JSONL
Graph Edit Distance between the plan from the first reasoning block and the plan at session close. From RECAP (arxiv 2509.04472). High GED = agent rewrote its approach entirely.
✓ No LLM judge needed — uses networkx GED + BERTScore for node label matching. Captures plan revision that correction_ratio misses.
repetition_rate
dynamics.repetition_rate
Embedding float Needs session JSONL
Fraction of tool calls with cosine_sim > 0.95 to a prior call in the same session. The "stuck loop" signal.
✓ Pure behavioral signal. Repetition is bad in all cases. Can use exact-match on (tool_name, args) as lighter proxy.
user_msg_length_slope
dynamics.user_msg_length_slope
Structural float Needs session JSONL
Linear regression slope over user message token lengths across turns. Negative = user giving up. Positive = user re-explaining. Near-zero = healthy. No model needed.
✓ Zero model dependency. Negative slope is the clearest signal of silent session abandonment, which correction_ratio cannot detect.
session_failure_mode
dynamics.session_failure_mode
Derived enum Needs all above
Composite classification: clean / shifted_intent / assumption_locked / scope_bleed / stuck_loop / multi_intent_confusion. Decision tree over CP curve shape, correction_ratio, repetition_rate, and drift_onset_turn.
⊕ This is the label that makes all other signals interpretable. The classification logic is deterministic once input signals are computed.
User Session — raw signals from the human side of the conversation
Comprehensive inventory of signals observed across real session datasets — pi-mono, hermes-agent-reasoning-traces, swival, TRAIL, and the OpenTraces schema. Collect everything raw. Interpret nothing at collection time. Timestamps are timestamps. Message lengths are lengths. What they mean is analysis, not collection. Use ★ High value only above to filter to signals with the strongest signal-to-noise backing.
session_outcome — how the session ended (gating signal for all satisfaction inference)
accepted
User explicitly accepted output — applied a patch, ran the code, said "thanks", closed with a positive signal. Strongest quality label available from a live session.
restarted_same_task
New session opened on the same codebase/file within a short window after this one ended. Strong signal that the previous session did not fully deliver.
abandoned_mid_session
Session ended with no acceptance signal and no explicit rejection. User stopped engaging. Raw — do not label as failure without corroborating signals.
explicitly_rejected
User explicitly reverted all changes, said "undo everything", or dismissed the session. Clearest failure signal available from the user side.
Signal Source Type Description Notes
Outcome & Acceptance
session_outcome
user.session_outcome
OpenTraces outcome block · pi-mono session events
Structural enum
How the session ended. See enum above. Gating label — without this, no satisfaction signal is interpretable.
⊕ Hardest single signal to get right. "Abandoned" is ambiguous by definition. Collect the raw end-state and derive the label later.
output_applied
user.output_applied
OpenTraces git attribution · pi-mono
Structural bool
Did the user apply or commit the agent's output? Detectable from git state at session close. Strongest quality proxy without asking the user.
✓ Completely objective. Git commit after session = output was accepted.
changes_reverted
user.changes_reverted
Structural bool
Were agent-made changes reverted before session end? Detected via git diff between mid-session peak and session-close state.
✓ Objective. Complementary to output_applied.
Turn Structure
user_agent_turn_ratio
user.turn_ratio
hermes-agent-reasoning-traces · pi-mono · TRAIL
Structural float
User turns divided by agent turns. Ratio near 1.0 = back-and-forth. Ratio < 0.5 = agent dominated. Ratio > 1.5 = user steering.
✓ Pure count from session JSONL.
user_turn_count
user.user_turn_count
Structural int
Raw count of user turns. Baseline denominator for all ratio signals.
✓ Trivially extracted from session JSONL message roles.
user_msg_lengths
user.user_msg_lengths[]
Structural list[int]
Token length of each user message in order. Raw list — do not interpret at collection.
✓ Collect raw. Analysis can later derive slope, variance, drop-off patterns.
all_turn_timestamps
user.turn_timestamps[]
Structural list[str]
ISO timestamp for every turn (both user and agent), in order. Raw — collect everything.
⊕ Do not label gaps as "gave up" at collection time. Timestamps are facts.
Code Region Signals
region_edit_history
user.region_edit_history[]
OpenTraces attribution (line ranges) · pi-mono branch summaries
Structural list[obj]
Per-edit record: {file, start_line, end_line, turn_index, timestamp}. Foundation for all "same area" analysis.
✓ Most important missing signal. File-level is too coarse. Function/line-range is the right unit.
region_re_edit_count
user.region_re_edit_count
derived from OpenTraces line-range attribution
Derived int
Count of distinct code regions edited more than once. Derived from region_edit_history using overlap detection.
✓ Direct answer to "multiple edits over the same areas." Requires region_edit_history first.
region_edit_convergence
user.region_edit_convergence[]
derived from OpenTraces line-range attribution
Derived list[str]
For each multiply-edited region: was the diff size shrinking (converging) or growing/oscillating (diverging)? Values: converging / oscillating / expanding.
⊕ Converging re-edits are fine — iterative refinement. Oscillating is the failure signal. Without this distinction, re-edit count alone is misleading.
Agent Behavior Toward User
agent_hedging_rate
user.agent_hedging_rate
Derived float
Frequency of hedging phrases: "I think", "might be", "probably", "I'm not sure". Rate per 100 tokens.
✓ Pure regex. A rising trajectory is more informative than the average.
agent_hedging_curve
user.agent_hedging_curve[]
hermes-agent-reasoning-traces · pi-mono assistant messages
Derived list[float]
Hedging rate per agent turn in order. Collect raw — the trajectory shape is the signal.
✓ Collect raw list. Same philosophy as user_msg_lengths.
confirmation_requests
user.confirmation_requests
Structural int
Times the agent asked the user to confirm before proceeding. "Should I...", "Do you want me to...".
⊕ High on complex operations is correct behavior. High on trivial operations = agent not confident.
time_to_first_edit_s
user.time_to_first_edit_s
OpenTraces attempt-span schema · pi-mono timestamps
Structural float
Seconds from session start to first file modification. Raw number — fast is not always better.
✓ Derivable from timestamps. No interpretation baked in.
per_turn_agent_latency_s
user.per_turn_latency_s[]
swival OpenTraces-compatible traces · pi-mono turn metadata
Structural list[float]
Wall-clock seconds per agent turn. Collect the full series.
✓ Directly from turn timestamps in session JSONL.
Session Infrastructure
compaction_events
user.compaction_events[]
pi-mono compaction summaries · OpenTraces session lifecycle
Claude list[obj]
Context window compaction events. Each event: {turn_index, tokens_before, tokens_after, timestamp}.
⊕ Compaction loses information. Relevant for understanding late-session drift and instruction adherence failure.
model_changes
user.model_changes[]
Claude list[obj]
Model switches mid-session: {from_model, to_model, turn_index}.
✓ Structured event in session JSONL. Zero inference needed.
tool_error_rate
user.tool_error_rate
hermes-agent-reasoning-traces tool_response · TRAIL execution errors
Claude float
Fraction of tool calls that returned an error result. Also collect as tool_errors_by_type[] per tool.
⊕ Some tool errors are expected (probing). High consecutive errors on the same tool/args = stuck.
tool_result_sizes
user.tool_result_sizes[]
Claude list[int]
Token length of each tool call result in order. Large results contribute disproportionately to context growth.
✓ Combined with context_growth_curve explains which tool calls caused context bloat.
Multi-Session Context
prior_sessions_same_repo
user.prior_sessions_same_repo
Structural int
Count of prior sessions on the same repository.
✓ Derivable from session metadata.
prior_sessions_same_region
user.prior_sessions_same_region
OpenTraces cross-session attribution · pi-mono session metadata
Derived int
Count of prior sessions that touched the same file+line-range as this session. A region touched in 5 separate sessions is a signal nothing else surfaces.
✓ Requires cross-session join on region_edit_history.
session_reopened_within
user.session_reopened_within_s
Structural float
Seconds until the same user opened a new session on the same repo after this one ended. Null if no new session within 24h.
⊕ Collect raw. Whether a quick reopen means the first session failed is analysis, not collection.