AGENT SIGNALS
Feature Engineering Inventory
Filter
Attempt-Span Schema — trial › attempt_span[] Each attempt = one uninterrupted agent run within a trial
behavior
turn_count
tool_call_count
edit_calls
bash_mediated_edits
touched_files
diff_lines
tool_sequence[]
localization_mode
reasoning
has_reasoning_blocks
reasoning_tokens_total
plan_steps_detected
self_correction_count
phase_transitions[]
plannotator_phases[]
tokens
input_tokens_total
output_tokens_total
cache_read_tokens
cache_write_tokens
context_growth_curve[]
cost_usd
verification
reward
patch_outcome_label
fail_to_pass_count
pass_to_fail_count
pass_to_pass_count
fail_to_fail_count
patch_applies
infra
wall_time_s
timed_out
attempt_index
recovery_mode
final_failure_mode
Telemetry Gap — Claude Session JSONL not collected from Modal
telemetry.sessions: true is set in eval runs, so the agent writes per-turn JSONL to /root/.subq/agent/sessions/ inside each container — but this directory is never collected as a Modal artifact. All Reasoning and Tokens signals are currently unavailable from eval runs. They are available locally (94 files in ~/.subq/agent/sessions/) but those are interactive sessions, not eval trials.
Fix — add to artifact config
{ "source": "/root/.subq/agent/sessions",
  "destination": "sessions",
  "type": "directory" }
Behavioral Trajectory — how the agent acts
Captures what the agent did: which tools it called, in what order, which files it touched, and how the controller classified its localization strategy. These signals come from stdout.txt [subq-eval] lines and result.json. Highest reliability — source-of-truth from the harness.
Typical tool sequences observed in eval traces
Resolved
Read Bash(grep) Read Reason Edit Run tests
bash-mediated edit
Read Bash(grep) Bash(sed -i) Bash(python) editCalls=0, touchedFiles≠∅ — invisible to edit counter
Failed / no patch
Read Bash(grep) Bash(grep) Bash(grep) many reads, no edits
Signal Source Type Status Description Validity / Pitfalls
turn_count
behavior.turn_count
Harness int Extracted
Number of agent turns before termination. Parsed from [subq-eval] Turns: N.
Poor proxy alone. Literature shows turn count conflates task difficulty with agent efficiency. A 3-turn success on a hard task ≠ a 3-turn failure on an easy one. Use conditioned on task difficulty.
tool_call_count
behavior.tool_call_count
Harness int Extracted
Total tool calls made across the attempt. From [subq-eval] Tool calls: N.
⚠ Counts bash and read equally. A 40-call success with 38 reads looks the same as 40 edits. Decompose by tool type for signal quality.
edit_calls
behavior.edit_calls
Harness int Extracted
Calls to the edit / str_replace_editor tool specifically. From [subq-eval] Edit calls: N.
✓ More targeted than total tool calls. But incomplete — does not capture bash-mediated writes (see bash_mediated_edits).
bash_mediated_edits
behavior.bash_mediated_edits
pi-mono · swival · OpenTraces
Harness bool Extracted
True when edit_calls = 0 but touched_files ≠ ∅. Agent modified files via bash (sed -i, heredocs) — invisible to the edit counter.
⚠ Common failure pattern in evaluated traces. These patches are structurally different (no diff capture at call time). Worth flagging as a separate behavioral cluster.
touched_files
behavior.touched_files[]
Harness list[str] Extracted
File paths modified during the attempt. From [subq-eval] Touched files: path (N diff lines). Includes bash-modified files.
✓ Reliable. Derived from diff against pre-attempt snapshot, not from tool call logging. Gold standard for "did the agent change anything."
diff_lines
behavior.diff_lines
Harness int Extracted
Total lines changed across all touched files. Sum of per-file diff line counts from stdout.
⚠ Large diffs aren't always better. Reformatting or whitespace changes inflate this. Correlates weakly with correctness alone.
tool_sequence
behavior.tool_sequence[]
hermes-agent-reasoning-traces · pi-mono · OpenTraces TAO loop
Harness list[str] Not extracted
Ordered list of tool names called. Enables bigram/trigram analysis: read→edit, bash→bash→bash (stuck loop), edit→bash(test) (verify-after-edit).
⚠ Requires parsing full stdout interleave, not just summary lines. High value for behavioral clustering. Not yet implemented.
files_read_before_edit
behavior.files_read_before_edit
Harness int Not extracted
How many distinct files the agent read before making its first edit. Proxy for localization thoroughness.
⚠ Requires stdout parsing. Literature suggests over-exploration (high reads, delayed edit) correlates with uncertainty and lower patch quality in SWE-bench settings.
localization_mode
behavior.localization_mode
Harness enum Extracted
Controller classification of how the attempt was located: attempt1Mode, trialRecoveryMode, finalFailureMode. From LocalizationSignal on HarborResult.
✓ Structural signal from the harness controller, not the agent. Clean separation of concerns.
edit_to_read_ratio
behavior.edit_to_read_ratio (derived)
Derived float Not extracted
Ratio of edit calls (including bash-mediated) to read calls. High ratio = decisive agent; low ratio = over-exploratory or stuck.
⚠ Needs tool_sequence first. Good discriminator between behavioral clusters once sequence data exists.
Reasoning Signals — how the agent thinks
Signals derived from the agent's internal reasoning: extended thinking blocks, planning phases, self-corrections. All require Claude session JSONL (~/.subq/agent/sessions/*.jsonl). Currently unavailable from Modal eval runs — fix: add session directory to artifact config.
Blocked: All signals in this layer require sessions/*.jsonl which is not collected from Modal containers. Local interactive sessions at ~/.subq/agent/sessions/ (94 files) confirm the format is correct and rich. Until artifact collection is added, these signals cannot be computed for eval trials.
Signal Source Type Status Description Validity / Pitfalls
has_reasoning_blocks
reasoning.has_reasoning_blocks
Claude bool Needs artifact
Whether the model emitted extended thinking blocks in any turn. In session JSONL: message events with type: "thinking" content blocks.
⚠ Models vary: some emit long think blocks on easy tasks, short on hard ones. Binary presence is weak; length distribution is stronger.
reasoning_tokens_total
reasoning.reasoning_tokens_total
Claude int Needs artifact
Sum of output tokens attributed to thinking blocks across all turns. Not reported separately in session JSONL — requires summing content[].text.length for thinking blocks.
⚠ Per literature: reasoning length has diminishing returns and can anti-correlate with performance on straightforward tasks (overthinking).
plan_steps_detected
reasoning.plan_steps_detected
Claude int Needs artifact
Heuristic count of numbered/bulleted plan steps in the first assistant turn's text. Proxy for whether the agent formed an explicit plan before acting.
⚠ Regex-based, fragile. Agents can reason implicitly without numbered steps. Use as soft signal, not hard feature.
self_correction_count
reasoning.self_correction_count
Claude int Needs artifact
Times the agent explicitly revisited or abandoned a prior approach ("actually…", "wait, that won't work…"). Detectable from reasoning text in session JSONL.
✓ High-value signal per process reward model literature. Agents that self-correct mid-attempt tend to recover from localization errors more successfully.
phase_transitions
reasoning.phase_transitions[]
Claude list[str] Needs artifact
Sequence of detected phase labels: explorelocalizeimplementverify. From plannotator-phase custom events in session JSONL.
✓ Already emitted by the agent as structured events. Zero parsing heuristic needed — purely structural once JSONL is collected.
plannotator_phases
reasoning.plannotator_phases[]
pi-mono custom_message events · hermes-agent-reasoning-traces
Claude list[obj] Needs artifact
Full plannotator event objects from session JSONL: {"type":"custom_message","event":"plannotator-phase","phase":"implement"}. Ground-truth phase timeline.
✓ Structured, agent-emitted, no inference required. Best leading indicator of agent trajectory quality. Priority to unlock once artifact gap is fixed.
Verification Layer — what is objectively correct
Ground-truth outcome signals from the verifier. These are the most reliable signals in the inventory — computed against the task's test suite, not inferred from the agent trace. Source: verifier/report.json and result.json.
patch_outcome_label — replaces 4 boolean columns
correct_patch
reward = 1.0. All FAIL_TO_PASS tests now pass, no PASS_TO_FAIL regressions.
partial_patch
Some FAIL_TO_PASS tests pass but not all, OR FAIL_TO_PASS > 0 with PASS_TO_FAIL > 0. Meaningful progress, incomplete fix.
regressive_patch
reward = 0. PASS_TO_FAIL > 0. Agent broke previously-passing tests without fixing the target failure.
no_effect_patch
touched_files is non-empty but no test result changed. Patch was syntactically valid but semantically inert.
unverifiable
Patch did not apply cleanly, or verifier crashed/timed out. Cannot classify.
Signal Source Type Status Description Validity / Pitfalls
reward
verification.reward
TRAIL · swival · SWE-bench
Harness float Extracted
Binary reward: 1.0 = all FAIL_TO_PASS tests pass with no regressions. 0.0 otherwise. From result.json.
✓ Ground truth for autoresearch metric. But binary — misses partial progress. Use alongside patch_outcome_label for richer signal.
patch_outcome_label
verification.patch_outcome_label
Derived enum Not extracted
5-value enum derived from verifier test counts. See enum block above. Computed from verifier/report.json FAIL_TO_PASS / PASS_TO_FAIL counts.
✓ Discriminates partial progress and regressions that reward=0 collapses together. Critical for failure analysis and cluster labeling.
fail_to_pass_count
verification.fail_to_pass_count
TRAIL · SWE-bench verifier
Harness int Extracted
Tests that were failing before the patch and pass after. The primary "fixed" signal. From verifier/report.json → FAIL_TO_PASS.
✓ Gold signal. Task-level ground truth. Use as the numerator for partial fix rate.
pass_to_fail_count
verification.pass_to_fail_count
TRAIL · SWE-bench verifier
Harness int Extracted
Tests that were passing before and fail after — regressions introduced by the patch. From verifier/report.json → PASS_TO_FAIL.
✓ Any non-zero value is a quality signal. Regressions should be weighted heavily in any scoring function.
pass_to_pass_count
verification.pass_to_pass_count
Harness int Extracted
Tests that passed before and pass after — preserved behavior. From verifier/report.json → PASS_TO_PASS.
⚠ High counts expected on every run. Useful only as denominator for regression rate, not as a quality signal by itself.
fail_to_fail_count
verification.fail_to_fail_count
Harness int Extracted
Tests that failed before and still fail after — no progress on these. From verifier/report.json → FAIL_TO_FAIL.
⚠ High count means partial fix or complete miss. Good for diagnosing scope of remaining failures.
patch_applies
verification.patch_applies
Harness bool Extracted
Whether the generated patch applied cleanly to the codebase. False → unverifiable outcome. From result.json.
✓ Pre-condition for all other verification signals. Must gate the entire verification layer.
Cost & Infra — what it costs to get there
Token economics, cache utilization, wall time, and infrastructure-level signals. Harness-side signals (wall time, timeout) are already collected. Token-level signals require session JSONL from Modal containers.
Signal Source Type Status Description Validity / Pitfalls
wall_time_s
infra.wall_time_s
Harness float Extracted
Wall clock time for the attempt in seconds. From result.json → duration.
⚠ Affected by container cold start, inference server load, concurrency. Not pure agent time. Use as relative comparator within a run, not across runs.
timed_out
infra.timed_out
Harness bool Extracted
Whether the container hit its timeout limit before completing. Sphinx tasks routinely timeout at 50 min. From result.json.
⚠ Timeout rate is task-dependent (Sphinx >> Django on verified-mini). Must segment by task set before using as agent quality signal.
attempt_index
infra.attempt_index
Harness int Extracted
Which attempt within the trial this was (0-indexed). Combined with localization_mode, indicates whether recovery was triggered.
✓ Essential for decomposing trial-level signals into attempt-span signals. Required for the attempt-span schema to be meaningful.
input_tokens_total
tokens.input_tokens_total
Claude int Needs artifact
Total input tokens consumed across all turns. In session JSONL: sum of usage.input per message event.
⚠ Grows monotonically (context accumulation). Turn 1 might be 15k tokens, turn 15 might be 45k. Context growth curve is more informative than total.
cache_read_ratio
tokens.cache_read_ratio
Derived float Needs artifact
cacheRead / (input + cacheRead) per turn. From session JSONL usage.cacheRead. High ratio = efficient context reuse.
✓ Direct cost efficiency proxy. Cache reads are ~10x cheaper than input tokens on Claude. Low ratio on long attempts suggests cache invalidation or poor prompt structure.
cost_usd
tokens.cost_usd
Claude float Needs artifact
Estimated total USD cost of the attempt. From session JSONL usage.cost object. Aggregated across all turns.
⚠ Cost per resolved task (cost / reward) is the key metric for optimization. Raw cost without outcome is incomplete.
context_growth_curve
tokens.context_growth_curve[]
OpenTraces tokens schema · pi-mono session traces
Derived list[int] Needs artifact
Input token count at each turn. From session JSONL: [usage.input for each message event in order]. Shows context window growth trajectory.
⚠ Confirmed in local sessions: 9k → 45k input tokens over ~15 turns. Fast growth = agent accumulating large context or codebase dumps. Plateau = efficient summarization.
Agent Anticipation — did it understand what was actually needed
Did the agent understand the task from the first turn? Did it anticipate what the user actually needed, or did it make assumptions, drift scope, or misread intent? These signals are derived post-session — they require either a session JSONL artifact or an LLM-as-judge pass over the conversation. They answer a fundamentally different question than verification: not was the output correct, but did the agent understand what correct meant.

Key tension: user frustration ≠ agent failure. A clear prompt that the agent misread is agent failure. An underspecified prompt where the agent made a reasonable guess is attribution-unclear. These buckets must be separated before using frustration signals as training signal.
Reference Datasets
PatronusAI/TRAIL — 148 annotated traces, plan_optimality + instruction_adherence labels lambda/hermes-agent-reasoning-traces — 14.7k samples, <think> blocks, 24 avg turns badlogicgames/pi-mono — 627 sessions, tree-structured, branch summaries jedisct1/agent-traces-swival — 10.7k rows, security audits, OpenTraces-compatible OpenTraces — TAO loop schema v0.3, evidence tiers, git attribution
session_attribution — why did the session go wrong (or right)
correct_anticipation
Agent understood intent from turn 1. First plan matched eventual solution. Minimal course correction. User accepted without steering. This is taste — knowing what the user needed before they fully articulated it.
scope_conflation
Task boundaries dissolved mid-session. Adjacent concerns got folded in — agent started solving related but un-asked problems. Output expanded beyond the original request. Often looks like thoroughness but is actually drift.
agent_misunderstood
Prompt was clear; agent paraphrased incorrectly in first reasoning block or dove in the wrong direction. Detectable when: user correction arrives early, agent's first tool call goes to unrelated area, plan-to-solution divergence is high.
underspecified_prompt
Prompt was too vague for any agent to resolve without guessing. Agent's guess may have been reasonable. Frustration from user here is not a clean training signal — attributing it to agent failure contaminates the label. Detectable by: prompt length, absence of context, agent asking clarifying questions.
Signal Source Type Status Description Validity / Pitfalls
task_paraphrase_accuracy
anticipation.task_paraphrase_accuracy
LLM judge float Not extracted
Semantic similarity between the user's original request and the agent's restatement of the task in its first reasoning block or response. Low score = agent misread intent from the start.
⚠ Requires session JSONL + LLM judge. TRAIL dataset provides human-annotated instruction_adherence labels as a reference schema. First-turn restatement is the earliest detectable signal of understanding failure.
plan_optimality
anticipation.plan_optimality
TRAIL (human-annotated) · hermes-agent-reasoning-traces
LLM judge enum Not extracted
Was the initial plan the agent formed close to optimal for the task? Drawn from TRAIL's annotation schema. Values: optimal / suboptimal / wrong_direction. Compare first reasoning block plan to eventual solution path.
✓ TRAIL benchmark shows best models achieve only 11% accuracy detecting planning errors — this is genuinely hard. High-value signal precisely because it's hard to fake. Hermes traces (14.7k samples) provide training reference for what optimal plans look like across task categories.
first_tool_precision
anticipation.first_tool_precision
Harness bool Not extracted
Did the agent's first substantive tool call (first Read or Bash) go to a file that ended up in touched_files? Direct read → was it the right file? True = agent localized correctly from the start, no wide exploration needed.
✓ Derivable from existing harness data once tool_sequence is extracted. High precision on first call is the clearest behavioral signal of genuine task understanding vs. exploratory guessing.
scope_drift
anticipation.scope_drift
Derived float Not extracted
Ratio of files touched to files minimally required to solve the task. Score of 1.0 = minimal footprint. Score of 3.0 = agent touched 3x more than needed. Proxy for whether the agent understood the task boundaries.
⚠ "Minimally required" is only knowable in hindsight (from the gold solution or verifier). For eval runs: compare touched files to the gold patch's file set. For live sessions: requires LLM judge to estimate minimal scope.
clarification_sought
anticipation.clarification_sought
Claude bool Needs artifact
Did the agent ask a clarifying question before acting? Detectable from session JSONL: first assistant turn contains a question directed at the user rather than a plan or tool call. Correlates with prompt underspecification.
⚠ Asking for clarification is not always good or bad — depends on task type. On a vague prompt it's correct behavior. On a clear prompt it signals the agent didn't read carefully. Must be interpreted alongside task_paraphrase_accuracy.
assumption_count
anticipation.assumption_count
Claude int Needs artifact
Heuristic count of assumption-signaling phrases in the first reasoning block: "I'll assume", "assuming", "I think you mean", "probably wants", "likely refers to". From session JSONL thinking blocks.
⚠ High assumption count on an underspecified prompt is normal and fine. High assumption count on a detailed, specific prompt is a red flag — agent is not reading carefully. Attribution depends on prompt quality score.
prompt_underspecification
anticipation.prompt_underspecification
Derived float Not extracted
How vague was the original prompt? Composite of: prompt token length (short = vague), absence of specific file/function references, absence of expected outcome description, ambiguous pronoun count. Gates attribution — a low score means frustration signals are unreliable as agent quality labels.
⚠ Critical gating signal. Without this, frustration from the user gets misattributed to agent failure. Ramp reportedly doubled token spend with no visible product gains — part of that is agents faithfully executing underspecified tasks very thoroughly.
conflation_turn
anticipation.conflation_turn
LLM judge int Not extracted
The turn index at which the agent's scope started expanding beyond the original request — when it began solving adjacent problems, refactoring unrelated code, or introducing unrequested changes. null if scope stayed clean.
⚠ Early conflation (turn 2-3) = agent misread scope from the start. Late conflation (turn 10+) = task boundaries dissolved during execution, often after partial success revealed adjacent issues. These are different failure modes.
instruction_adherence
anticipation.instruction_adherence
TRAIL annotation schema · RECAP
LLM judge enum Not extracted
Did the agent follow the instruction as given? From TRAIL's annotation schema. Values: full / partial / ignored / contradicted. Distinct from correctness — the agent can follow instructions and still produce wrong output.
✓ TRAIL provides human-annotated reference labels across 148 traces. Use as benchmark for LLM judge calibration. Gemini-2.5-Pro achieves only 11% on trace debugging — instruction adherence scoring needs careful judge design.
goal_alignment_at_close
anticipation.goal_alignment_at_close
LLM judge float Not extracted
Semantic similarity between the user's original request and what the agent actually delivered at session end. Measures whether the final output addresses what was asked, not just whether it's technically correct. Low score with reward=1 = solved the benchmark but missed the point.
⚠ This is the "taste" signal from the k-shape productivity post. Agents can produce correct code that misses the product goal entirely. Senior engineers catch this; junior engineers don't. Score separates CRUD velocity from genuine task completion.
Session Dynamics — how understanding evolved across turns
Anticipation catches failures at turn 0. Session dynamics catches everything that happens after — drift, locking, spiraling, scope bleed, goal shift. Research (arxiv 2505.02709, 2602.07338) shows these are distinct failure modes: an agent can start correctly and drift, or start wrong and compound. ~30% performance drop in multi-turn settings comes not from capability gaps but from agents locking early assumptions and failing to update (Lost in Conversation, arxiv 2602.07338). Most of these signals are computable without an LLM judge — they require an embedding model and turn-level session JSONL only.
session_failure_mode — six distinct failure types (from RECAP + goal drift literature)
clean
No meaningful drift. Context pollution stayed low. Agent updated its model correctly as turns progressed. User accepted without heavy steering.
shifted_intent
User's goal actually changed mid-session — not agent failure. Detectable when: user embeds at turn N are semantically distant from turn 0 in a direction the agent couldn't have anticipated. Not a clean training signal for agent quality.
assumption_locked
Agent locked in wrong assumption early and stopped updating. From arxiv 2602.07338: models "maximize the most statistically probable intent" rather than tracking user corrections. User corrections arrive but agent keeps executing the original wrong interpretation. High correction_ratio + flat context pollution = locked.
scope_bleed
Agent started touching things it was never asked to touch. scope_drift > 2.0. Tool sequence goes wide. Conflation turn is identifiable. Often looks like thoroughness. This is the "backlogs packed with CRUD work" failure from the k-shape post — agent busy, not useful.
stuck_loop
Agent repeated the same tool calls without making progress. High repetition_rate. Same grep/read pattern 3–5 times. Context window growing but no new files touched. Usually precedes timeout or agent giving up. GD_inaction score high — agent unable to abandon its broken approach.
multi_intent_confusion
Agent tried to address multiple tangled goals simultaneously. From RECAP: "multiple distinct goals presented together." Plan graph has high node count from turn 1. Scope is wide from the start, not from drift. Distinct from scope_bleed — it was always wide, not expanding.
Computable without an LLM judge — tools and formulas NO LLM REQUIRED
Context Pollution
CP_score
CP = 1 − cosine_sim(embed(turn_0), embed(turn_N))
CP > 0.45 = severe misalignment, re-anchor required. Early sharp rise = started wrong. Gradual rise = natural. From Kurtis Kemple / getmaxim.ai research.
sentence-transformers, numpy
Plan Edit Distance
plan_ged
GED(plan_turn_1, plan_turn_N)
Graph Edit Distance between extracted plan graphs. High GED = agent rewrote its approach entirely. From RECAP (arxiv 2509.04472). Node/edge delta as lightweight proxy.
networkx.graph_edit_distance
Correction Ratio
correction_ratio
corrections / total_user_turns
Count user turns containing: "no", "that's not", "wrong", "undo", "revert", "actually", "wait", "stop". Pure regex. No embedding needed.
regex on session JSONL user turns
Repetition Rate
repetition_rate
near_dup_calls / total_tool_calls
Tool calls with cosine_sim > 0.95 to a prior call = near-duplicate. High rate = stuck loop. Requires embedding tool call arguments.
sentence-transformers on tool args
Intent Coherence
intent_coherence_curve[]
cosine_sim(embed(task), embed(agent_turn_N))
How aligned is each agent response with the original task? Sudden drop = agent drifted. Plateau at low value = locked on wrong thing. From getmaxim.ai context pollution methodology.
sentence-transformers, per-turn
Msg Length Trend
user_msg_length_slope
linregress(turn_idx, user_msg_lengths)
Negative slope = user giving up, messages shortening. Positive slope = user re-explaining (agent not getting it). Flat = healthy back-and-forth. No model needed — pure token counts.
scipy.stats.linregress
Signal Source Type Status Description Validity / Research Backing
context_pollution_curve
dynamics.cp_curve[]
getmaxim.ai · arxiv 2602.07338 · pi-mono
Embedding list[float] Needs session JSONL
Per-turn CP = 1 − cosine_sim(turn_0_embed, turn_N_embed). Tracks semantic distance from the original task anchor across every turn. Sharp early rise = started wrong. Gradual then sudden jump = drift event mid-session.
✓ No LLM needed — any sentence-transformer works. CP > 0.45 defined as severe misalignment in production systems. Shape of the curve distinguishes "started wrong" from "drifted mid-session."
max_context_pollution
dynamics.max_cp
Embedding float Needs session JSONL
Peak context pollution score across all turns. Single scalar summary of how far the session drifted from the original task at its worst point. > 0.45 = severe. > 0.2 = moderate.
✓ Scalar summary of the full curve. Useful for session-level bucketing and cluster labeling without reading the full turn sequence.
drift_onset_turn
dynamics.drift_onset_turn
Embedding int Needs session JSONL
First turn where CP > 0.2 (moderate drift threshold). Turn 1–2 = agent misread from start. Turn 5–10 = mid-session drift. Turn > 15 = late drift, possibly scope bleed after partial success.
✓ Early vs. late onset separates "started wrong" from "drifted" — two completely different failure modes that demand different fixes.
correction_ratio
dynamics.correction_ratio
COLING 2025 frustration detection · pi-mono · hermes
Regex float Needs session JSONL
Fraction of user turns containing correction signals: "no", "that's not", "wrong", "undo", "revert", "actually", "wait", "stop", "that's incorrect". Pure regex — no model needed. High ratio = user spent most of the session steering the agent.
⚠ COLING 2025 research confirms keyword-based approaches miss frustrated users who don't use explicit negations. Must be supplemented with message length trend and context pollution for full picture. Treats all corrections as equivalent — severity not captured.
assumption_lock_turn
dynamics.assumption_lock_turn
arxiv 2602.07338 (Lost in Conversation) · hermes-agent-reasoning-traces
Embedding int Needs session JSONL
Turn where agent stopped updating its working model despite user corrections. Signature: correction_ratio rises but intent_coherence_curve flatlines. From arxiv 2602.07338 — "models lock in early assumptions and stop incorporating new user information."
~30% performance drop in multi-turn traced to this exact pattern. Models revert to "average user" statistical prior rather than tracking individual corrections. Flat coherence curve despite high correction_ratio is the clearest diagnostic.
gd_inaction_score
dynamics.gd_inaction
Derived float Needs session JSONL
From arxiv 2505.02709: measures agent's failure to abandon a wrong approach even after evidence it isn't working. Operationalized as: files/functions in the initial plan that were never touched divided by total initial plan items. High = agent kept promising things it never executed.
✓ Research paper tested on Claude 3.5 Sonnet — even top models show nonzero GD_inaction. Complementary to GD_actions: an agent can drift by commission (doing wrong things) or omission (failing to do the right things).
gd_actions_score
dynamics.gd_actions
Derived float Needs session JSONL
From arxiv 2505.02709: fraction of agent actions (tool calls, file edits) directed at areas outside the original task scope. High = agent actively investing effort in the wrong direction — not just drifting but actively executing a wrong plan.
⚠ Requires knowing the "correct" scope — in eval runs this comes from the gold patch. In live sessions needs LLM judge to estimate minimal required scope. Use scope_drift as a simpler proxy when gold solution unavailable.
plan_graph_edit_distance
dynamics.plan_ged
Derived float Needs session JSONL
Graph Edit Distance between the plan extracted from the agent's first reasoning block and the plan at session close. From RECAP (arxiv 2509.04472). High GED = agent rewrote its approach entirely. Low GED = stayed on plan (good or bad depending on whether the plan was right).
✓ No LLM judge needed — uses networkx GED + BERTScore for node label matching. RECAP provides reference implementation. Structurally captures plan revision that correction_ratio misses entirely (agent can rewrite plan silently without the user noticing).
repetition_rate
dynamics.repetition_rate
Embedding float Needs session JSONL
Fraction of tool calls with cosine_sim > 0.95 to a prior call in the same session. The "stuck loop" signal. An agent running the same grep 4 times or reading the same file 3 times is stuck, not exploring.
✓ Pure behavioral signal — no ground truth needed. Repetition is bad in all cases. Can be computed without embeddings using exact-match on (tool_name, args) pairs as a lighter proxy.
user_msg_length_slope
dynamics.user_msg_length_slope
Structural float Needs session JSONL
Linear regression slope over user message token lengths across turns. Negative = user giving up, messages collapsing. Positive = user re-explaining at increasing length (agent not getting it). Near-zero = healthy back-and-forth. No model needed.
✓ Zero model dependency — pure token counts from session JSONL. Negative slope is the clearest behavioral signal of silent session abandonment, which correction_ratio cannot detect (users stop correcting and just give up).
session_failure_mode
dynamics.session_failure_mode
Derived enum Needs all above
Composite classification derived from the signals above. Values: clean / shifted_intent / assumption_locked / scope_bleed / stuck_loop / multi_intent_confusion. Decision tree over CP curve shape, correction_ratio, repetition_rate, and drift_onset_turn.
⚠ This is the label that makes all other signals interpretable. Without it, high correction_ratio could be healthy collaboration or a broken session. The classification logic is deterministic — no LLM judge needed once input signals are computed.
User Session — raw signals from the human side of the conversation
Comprehensive inventory of signals observed across real session datasets — pi-mono, hermes-agent-reasoning-traces, swival, TRAIL, and the OpenTraces schema. Collect everything raw. Interpret nothing at collection time. Timestamps are timestamps. Message lengths are lengths. What they mean is analysis, not collection. Use ★ High value only above to filter to signals with the strongest signal-to-noise backing from research and dataset observations.
session_outcome — how the session ended (gating signal for all satisfaction inference)
accepted
User explicitly accepted output — applied a patch, ran the code, said "thanks", closed with a positive signal. Strongest quality label available from a live session.
restarted_same_task
New session opened on the same codebase/file within a short window after this one ended. Strong signal that the previous session did not fully deliver. Not frustration necessarily — might be scope continuation.
abandoned_mid_session
Session ended with no acceptance signal and no explicit rejection. User stopped engaging. Could be gave up, got what they needed without saying so, or switched to a different approach entirely. Raw — do not label as failure without corroborating signals.
explicitly_rejected
User explicitly reverted all changes, said "undo everything", or dismissed the session. Clearest failure signal available from the user side.
Signal Source Type Description Notes
Outcome & Acceptance
session_outcome
user.session_outcome
OpenTraces outcome block · pi-mono session events
Structural enum
How the session ended. See enum above. Gating label — without this, no satisfaction signal is interpretable. Derived from: last user message content, whether changes were committed, whether a new session started on same files.
⚠ Hardest single signal to get right. "Abandoned" is ambiguous by definition. Collect the raw end-state and derive the label later.
output_applied
user.output_applied
OpenTraces git attribution · pi-mono
Structural bool
Did the user apply or commit the agent's output? Detectable from git state at session close — did touched_files end up in a commit? Strongest quality proxy available without asking the user anything.
✓ Completely objective. No interpretation needed. Git commit after session = output was accepted. Already partially available via OpenTraces schema's git attribution layer.
changes_reverted
user.changes_reverted
Structural bool
Were agent-made changes reverted before session end? Detected via git diff between mid-session peak and session-close state. True = user undid the work.
✓ Objective. Complementary to output_applied — together they produce a clean accept/reject signal.
Turn Structure
user_agent_turn_ratio
user.turn_ratio
hermes-agent-reasoning-traces · pi-mono · TRAIL
Structural float
User turns divided by agent turns. Ratio near 1.0 = back-and-forth. Ratio < 0.5 = agent dominated the session, user mostly watching. Ratio > 1.5 = user was doing most of the steering.
✓ Pure count from session JSONL. No interpretation baked in at collection time.
user_turn_count
user.user_turn_count
Structural int
Raw count of user turns. Baseline denominator for all ratio signals. Separate from total turn count.
✓ Trivially extracted from session JSONL message roles.
user_msg_lengths
user.user_msg_lengths[]
Structural list[int]
Token length of each user message in order. Raw list — do not interpret at collection. Analysis can later derive slope, variance, drop-off patterns.
✓ Collect raw. A shrinking trajectory, a flat trajectory, an expanding trajectory all mean different things depending on context — decide later.
all_turn_timestamps
user.turn_timestamps[]
Structural list[str]
ISO timestamp for every turn (both user and agent), in order. Raw — collect everything. Gaps, bursts, pacing patterns are all derivable later. A gap doesn't mean abandonment. A burst doesn't mean frustration.
⚠ Do not label gaps as "gave up" or quick replies as "frustrated" at collection time. Timestamps are facts; what they mean is analysis.
Code Region Signals
region_edit_history
user.region_edit_history[]
OpenTraces attribution (line ranges) · pi-mono branch summaries
Structural list[obj]
Per-edit record: {file, start_line, end_line, turn_index, timestamp}. Captures every edit at line-range granularity in sequence. Foundation for all "same area" analysis — coarser signals like touched_files lose this entirely.
✓ This is the most important missing signal in the current inventory. File-level is too coarse. Function/line-range is the right unit for detecting an agent struggling with a specific piece of code.
region_re_edit_count
user.region_re_edit_count
derived from OpenTraces line-range attribution
Derived int
Count of distinct code regions (file + line range) that were edited more than once. Derived from region_edit_history using overlap detection. High count = agent kept revisiting the same code, unable to get it right in one pass.
✓ Direct answer to "multiple edits over the same areas." Requires region_edit_history first.
region_edit_convergence
user.region_edit_convergence[]
derived from OpenTraces line-range attribution
Derived list[str]
For each multiply-edited region: was the diff size shrinking (converging — homing in on the fix) or growing/oscillating (diverging — agent rewriting the same code repeatedly without progress)? Values per region: converging / oscillating / expanding.
⚠ Converging re-edits are fine — iterative refinement. Oscillating re-edits on the same region are the failure signal. Without this distinction, re-edit count alone is misleading.
Agent Behavior Toward User
agent_hedging_rate
user.agent_hedging_rate
Derived float
Frequency of hedging phrases in agent turns: "I think", "might be", "probably", "I'm not sure", "I believe", "it seems". Rate per 100 tokens. Also collect as a per-turn list — rising rate across turns = agent becoming less certain as session progresses.
✓ Pure regex. No model needed. A rising trajectory is more informative than the average — agent starting confident and becoming uncertain is a different pattern from uniform low confidence throughout.
agent_hedging_curve
user.agent_hedging_curve[]
hermes-agent-reasoning-traces · pi-mono assistant messages
Derived list[float]
Hedging rate per agent turn in order. Collect raw — do not summarize to a single number at collection time. The trajectory shape (flat, rising, falling, spiking) is the signal.
✓ Collect raw list. Same philosophy as user_msg_lengths — the shape matters more than the mean.
confirmation_requests
user.confirmation_requests
Structural int
Number of times the agent asked the user to confirm before proceeding. "Should I...", "Do you want me to...", "Is it okay if...". Separate from clarification questions about task intent.
⚠ High in complex/destructive operations is correct behavior. High on trivial operations = agent not confident in its own judgment. Context-dependent — collect the count and let analysis decide.
time_to_first_edit_s
user.time_to_first_edit_s
OpenTraces attempt-span schema · pi-mono timestamps
Structural float
Seconds from session start to first file modification. Long time = agent spent many turns exploring/reading before committing to an edit. Short time = agent localized immediately. Raw number — fast is not always better.
✓ Derivable from region_edit_history timestamps vs. session start. No interpretation baked in.
per_turn_agent_latency_s
user.per_turn_latency_s[]
swival OpenTraces-compatible traces · pi-mono turn metadata
Structural list[float]
Wall-clock seconds per agent turn from receiving user message to completing response. Raw list. Slow turns during key decision points vs. fast mechanical turns are both informative — collect the full series.
✓ Directly from turn timestamps in session JSONL. Useful for cost/quality tradeoff analysis when combined with token counts.
Session Infrastructure
compaction_events
user.compaction_events[]
pi-mono compaction summaries · OpenTraces session lifecycle
Claude list[obj]
Context window compaction events from session JSONL — when the agent summarized earlier context to free up window space. Each event: {turn_index, tokens_before, tokens_after, timestamp}. More compactions = longer/denser session. Captured in pi-mono dataset as structured events.
⚠ Compaction loses information. A session with many compaction events may have the agent working from a degraded summary of earlier turns — relevant for understanding late-session drift and instruction adherence failure.
model_changes
user.model_changes[]
Claude list[obj]
Model switches mid-session: {from_model, to_model, turn_index}. From session JSONL model_change events. User switching models during a session = something about the current model wasn't working for this task.
✓ Structured event in session JSONL. Zero inference needed. A model switch is a factual event — what it means is analysis.
tool_error_rate
user.tool_error_rate
hermes-agent-reasoning-traces tool_response · TRAIL execution errors
Claude float
Fraction of tool calls that returned an error result. Bash commands that failed, files that didn't exist, edits that were rejected. Also collect as tool_errors_by_type[] per tool. High error rate = agent executing bad commands.
⚠ Some tool errors are expected (agent probing for a file that might not exist). High consecutive errors on the same tool/args = stuck, not probing.
tool_result_sizes
user.tool_result_sizes[]
Claude list[int]
Token length of each tool call result in order. Large results (e.g., massive grep output, full file reads) contribute disproportionately to context growth. Collect raw per-call — useful for tracing context window explosion.
✓ Direct from session JSONL tool result content. Combined with context_growth_curve explains exactly which tool calls caused context bloat.
Multi-Session Context
prior_sessions_same_repo
user.prior_sessions_same_repo
Structural int
Count of prior sessions on the same repository. First session on a codebase is a cold start — different baseline for localization difficulty. Returning user has context the agent doesn't.
✓ Derivable from session metadata (repo path + user ID). No analysis needed at collection time.
prior_sessions_same_region
user.prior_sessions_same_region
OpenTraces cross-session attribution · pi-mono session metadata
Derived int
Count of prior sessions that touched the same file+line-range as this session. Persistent re-visits to the same code region across sessions = either a genuinely hard area or a recurring failure to fix it properly. Requires region_edit_history across sessions.
✓ Requires cross-session join on region_edit_history. Worth collecting — a region touched in 5 separate sessions is a signal nothing else surfaces.
session_reopened_within
user.session_reopened_within_s
Structural float
Seconds until the same user opened a new session on the same repo after this one ended. Null if no new session within 24h. Raw number — do not label as "failed" at collection time. A quick restart could be scope continuation, not failure.
⚠ Collect raw. Whether a quick reopen means the first session failed is analysis, not collection.