telemetry.sessions: true is set in eval runs, so the agent writes per-turn JSONL to /root/.subq/agent/sessions/ inside each container —
but this directory is never collected as a Modal artifact. All Reasoning and Tokens signals are currently unavailable from eval runs.
They are available locally (94 files in ~/.subq/agent/sessions/) but those are interactive sessions, not eval trials.
{ "source": "/root/.subq/agent/sessions",
"destination": "sessions",
"type": "directory" }
stdout.txt [subq-eval] lines and result.json. Highest reliability — source-of-truth from the harness.
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
turn_count behavior.turn_count |
Harness | int | Extracted | Number of agent turns before termination. Parsed from [subq-eval] Turns: N. |
⚠ Poor proxy alone. Literature shows turn count conflates task difficulty with agent efficiency. A 3-turn success on a hard task ≠ a 3-turn failure on an easy one. Use conditioned on task difficulty. |
tool_call_count behavior.tool_call_count |
Harness | int | Extracted | Total tool calls made across the attempt. From [subq-eval] Tool calls: N. |
⚠ Counts bash and read equally. A 40-call success with 38 reads looks the same as 40 edits. Decompose by tool type for signal quality. |
edit_calls behavior.edit_calls |
Harness | int | Extracted | Calls to the edit / str_replace_editor tool specifically. From [subq-eval] Edit calls: N. |
✓ More targeted than total tool calls. But incomplete — does not capture bash-mediated writes (see bash_mediated_edits). |
bash_mediated_edits ★ behavior.bash_mediated_edits pi-mono · swival · OpenTraces |
Harness | bool | Extracted | True when edit_calls = 0 but touched_files ≠ ∅. Agent modified files via bash (sed -i, heredocs) — invisible to the edit counter. |
⚠ Common failure pattern in evaluated traces. These patches are structurally different (no diff capture at call time). Worth flagging as a separate behavioral cluster. |
touched_files behavior.touched_files[] |
Harness | list[str] | Extracted | File paths modified during the attempt. From [subq-eval] Touched files: path (N diff lines). Includes bash-modified files. |
✓ Reliable. Derived from diff against pre-attempt snapshot, not from tool call logging. Gold standard for "did the agent change anything." |
diff_lines behavior.diff_lines |
Harness | int | Extracted | Total lines changed across all touched files. Sum of per-file diff line counts from stdout. |
⚠ Large diffs aren't always better. Reformatting or whitespace changes inflate this. Correlates weakly with correctness alone. |
tool_sequence ★ behavior.tool_sequence[] hermes-agent-reasoning-traces · pi-mono · OpenTraces TAO loop |
Harness | list[str] | Not extracted | Ordered list of tool names called. Enables bigram/trigram analysis: read→edit, bash→bash→bash (stuck loop), edit→bash(test) (verify-after-edit). |
⚠ Requires parsing full stdout interleave, not just summary lines. High value for behavioral clustering. Not yet implemented. |
files_read_before_edit behavior.files_read_before_edit |
Harness | int | Not extracted | How many distinct files the agent read before making its first edit. Proxy for localization thoroughness. |
⚠ Requires stdout parsing. Literature suggests over-exploration (high reads, delayed edit) correlates with uncertainty and lower patch quality in SWE-bench settings. |
localization_mode behavior.localization_mode |
Harness | enum | Extracted | Controller classification of how the attempt was located: attempt1Mode, trialRecoveryMode, finalFailureMode. From LocalizationSignal on HarborResult. |
✓ Structural signal from the harness controller, not the agent. Clean separation of concerns. |
edit_to_read_ratio behavior.edit_to_read_ratio (derived) |
Derived | float | Not extracted | Ratio of edit calls (including bash-mediated) to read calls. High ratio = decisive agent; low ratio = over-exploratory or stuck. |
⚠ Needs tool_sequence first. Good discriminator between behavioral clusters once sequence data exists. |
~/.subq/agent/sessions/*.jsonl).
Currently unavailable from Modal eval runs — fix: add session directory to artifact config.
sessions/*.jsonl which is not collected from Modal containers.
Local interactive sessions at ~/.subq/agent/sessions/ (94 files) confirm the format is correct and rich.
Until artifact collection is added, these signals cannot be computed for eval trials.
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
has_reasoning_blocks reasoning.has_reasoning_blocks |
Claude | bool | Needs artifact | Whether the model emitted extended thinking blocks in any turn. In session JSONL: message events with type: "thinking" content blocks. |
⚠ Models vary: some emit long think blocks on easy tasks, short on hard ones. Binary presence is weak; length distribution is stronger. |
reasoning_tokens_total reasoning.reasoning_tokens_total |
Claude | int | Needs artifact | Sum of output tokens attributed to thinking blocks across all turns. Not reported separately in session JSONL — requires summing content[].text.length for thinking blocks. |
⚠ Per literature: reasoning length has diminishing returns and can anti-correlate with performance on straightforward tasks (overthinking). |
plan_steps_detected reasoning.plan_steps_detected |
Claude | int | Needs artifact | Heuristic count of numbered/bulleted plan steps in the first assistant turn's text. Proxy for whether the agent formed an explicit plan before acting. |
⚠ Regex-based, fragile. Agents can reason implicitly without numbered steps. Use as soft signal, not hard feature. |
self_correction_count reasoning.self_correction_count |
Claude | int | Needs artifact | Times the agent explicitly revisited or abandoned a prior approach ("actually…", "wait, that won't work…"). Detectable from reasoning text in session JSONL. |
✓ High-value signal per process reward model literature. Agents that self-correct mid-attempt tend to recover from localization errors more successfully. |
phase_transitions reasoning.phase_transitions[] |
Claude | list[str] | Needs artifact | Sequence of detected phase labels: explore → localize → implement → verify. From plannotator-phase custom events in session JSONL. |
✓ Already emitted by the agent as structured events. Zero parsing heuristic needed — purely structural once JSONL is collected. |
plannotator_phases ★ reasoning.plannotator_phases[] pi-mono custom_message events · hermes-agent-reasoning-traces |
Claude | list[obj] | Needs artifact | Full plannotator event objects from session JSONL: {"type":"custom_message","event":"plannotator-phase","phase":"implement"}. Ground-truth phase timeline. |
✓ Structured, agent-emitted, no inference required. Best leading indicator of agent trajectory quality. Priority to unlock once artifact gap is fixed. |
verifier/report.json and result.json.
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
reward ★ verification.reward TRAIL · swival · SWE-bench |
Harness | float | Extracted | Binary reward: 1.0 = all FAIL_TO_PASS tests pass with no regressions. 0.0 otherwise. From result.json. |
✓ Ground truth for autoresearch metric. But binary — misses partial progress. Use alongside patch_outcome_label for richer signal. |
patch_outcome_label verification.patch_outcome_label |
Derived | enum | Not extracted | 5-value enum derived from verifier test counts. See enum block above. Computed from verifier/report.json FAIL_TO_PASS / PASS_TO_FAIL counts. |
✓ Discriminates partial progress and regressions that reward=0 collapses together. Critical for failure analysis and cluster labeling. |
fail_to_pass_count ★ verification.fail_to_pass_count TRAIL · SWE-bench verifier |
Harness | int | Extracted | Tests that were failing before the patch and pass after. The primary "fixed" signal. From verifier/report.json → FAIL_TO_PASS. |
✓ Gold signal. Task-level ground truth. Use as the numerator for partial fix rate. |
pass_to_fail_count ★ verification.pass_to_fail_count TRAIL · SWE-bench verifier |
Harness | int | Extracted | Tests that were passing before and fail after — regressions introduced by the patch. From verifier/report.json → PASS_TO_FAIL. |
✓ Any non-zero value is a quality signal. Regressions should be weighted heavily in any scoring function. |
pass_to_pass_count verification.pass_to_pass_count |
Harness | int | Extracted | Tests that passed before and pass after — preserved behavior. From verifier/report.json → PASS_TO_PASS. |
⚠ High counts expected on every run. Useful only as denominator for regression rate, not as a quality signal by itself. |
fail_to_fail_count verification.fail_to_fail_count |
Harness | int | Extracted | Tests that failed before and still fail after — no progress on these. From verifier/report.json → FAIL_TO_FAIL. |
⚠ High count means partial fix or complete miss. Good for diagnosing scope of remaining failures. |
patch_applies verification.patch_applies |
Harness | bool | Extracted | Whether the generated patch applied cleanly to the codebase. False → unverifiable outcome. From result.json. |
✓ Pre-condition for all other verification signals. Must gate the entire verification layer. |
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
wall_time_s ★ infra.wall_time_s |
Harness | float | Extracted | Wall clock time for the attempt in seconds. From result.json → duration. |
⚠ Affected by container cold start, inference server load, concurrency. Not pure agent time. Use as relative comparator within a run, not across runs. |
timed_out ★ infra.timed_out |
Harness | bool | Extracted | Whether the container hit its timeout limit before completing. Sphinx tasks routinely timeout at 50 min. From result.json. |
⚠ Timeout rate is task-dependent (Sphinx >> Django on verified-mini). Must segment by task set before using as agent quality signal. |
attempt_index infra.attempt_index |
Harness | int | Extracted | Which attempt within the trial this was (0-indexed). Combined with localization_mode, indicates whether recovery was triggered. |
✓ Essential for decomposing trial-level signals into attempt-span signals. Required for the attempt-span schema to be meaningful. |
input_tokens_total ★ tokens.input_tokens_total |
Claude | int | Needs artifact | Total input tokens consumed across all turns. In session JSONL: sum of usage.input per message event. |
⚠ Grows monotonically (context accumulation). Turn 1 might be 15k tokens, turn 15 might be 45k. Context growth curve is more informative than total. |
cache_read_ratio ★ tokens.cache_read_ratio |
Derived | float | Needs artifact | cacheRead / (input + cacheRead) per turn. From session JSONL usage.cacheRead. High ratio = efficient context reuse. |
✓ Direct cost efficiency proxy. Cache reads are ~10x cheaper than input tokens on Claude. Low ratio on long attempts suggests cache invalidation or poor prompt structure. |
cost_usd ★ tokens.cost_usd |
Claude | float | Needs artifact | Estimated total USD cost of the attempt. From session JSONL usage.cost object. Aggregated across all turns. |
⚠ Cost per resolved task (cost / reward) is the key metric for optimization. Raw cost without outcome is incomplete. |
context_growth_curve ★ tokens.context_growth_curve[] OpenTraces tokens schema · pi-mono session traces |
Derived | list[int] | Needs artifact | Input token count at each turn. From session JSONL: [usage.input for each message event in order]. Shows context window growth trajectory. |
⚠ Confirmed in local sessions: 9k → 45k input tokens over ~15 turns. Fast growth = agent accumulating large context or codebase dumps. Plateau = efficient summarization. |
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
task_paraphrase_accuracy anticipation.task_paraphrase_accuracy |
LLM judge | float | Not extracted | Semantic similarity between the user's original request and the agent's restatement of the task in its first reasoning block or response. Low score = agent misread intent from the start. |
⚠ Requires session JSONL + LLM judge. TRAIL dataset provides human-annotated instruction_adherence labels as a reference schema. First-turn restatement is the earliest detectable signal of understanding failure. |
plan_optimality ★ anticipation.plan_optimality TRAIL (human-annotated) · hermes-agent-reasoning-traces |
LLM judge | enum | Not extracted | Was the initial plan the agent formed close to optimal for the task? Drawn from TRAIL's annotation schema. Values: optimal / suboptimal / wrong_direction. Compare first reasoning block plan to eventual solution path. |
✓ TRAIL benchmark shows best models achieve only 11% accuracy detecting planning errors — this is genuinely hard. High-value signal precisely because it's hard to fake. Hermes traces (14.7k samples) provide training reference for what optimal plans look like across task categories. |
first_tool_precision anticipation.first_tool_precision |
Harness | bool | Not extracted | Did the agent's first substantive tool call (first Read or Bash) go to a file that ended up in touched_files? Direct read → was it the right file? True = agent localized correctly from the start, no wide exploration needed. |
✓ Derivable from existing harness data once tool_sequence is extracted. High precision on first call is the clearest behavioral signal of genuine task understanding vs. exploratory guessing. |
scope_drift anticipation.scope_drift |
Derived | float | Not extracted | Ratio of files touched to files minimally required to solve the task. Score of 1.0 = minimal footprint. Score of 3.0 = agent touched 3x more than needed. Proxy for whether the agent understood the task boundaries. |
⚠ "Minimally required" is only knowable in hindsight (from the gold solution or verifier). For eval runs: compare touched files to the gold patch's file set. For live sessions: requires LLM judge to estimate minimal scope. |
clarification_sought anticipation.clarification_sought |
Claude | bool | Needs artifact | Did the agent ask a clarifying question before acting? Detectable from session JSONL: first assistant turn contains a question directed at the user rather than a plan or tool call. Correlates with prompt underspecification. |
⚠ Asking for clarification is not always good or bad — depends on task type. On a vague prompt it's correct behavior. On a clear prompt it signals the agent didn't read carefully. Must be interpreted alongside task_paraphrase_accuracy. |
assumption_count anticipation.assumption_count |
Claude | int | Needs artifact | Heuristic count of assumption-signaling phrases in the first reasoning block: "I'll assume", "assuming", "I think you mean", "probably wants", "likely refers to". From session JSONL thinking blocks. |
⚠ High assumption count on an underspecified prompt is normal and fine. High assumption count on a detailed, specific prompt is a red flag — agent is not reading carefully. Attribution depends on prompt quality score. |
prompt_underspecification anticipation.prompt_underspecification |
Derived | float | Not extracted | How vague was the original prompt? Composite of: prompt token length (short = vague), absence of specific file/function references, absence of expected outcome description, ambiguous pronoun count. Gates attribution — a low score means frustration signals are unreliable as agent quality labels. |
⚠ Critical gating signal. Without this, frustration from the user gets misattributed to agent failure. Ramp reportedly doubled token spend with no visible product gains — part of that is agents faithfully executing underspecified tasks very thoroughly. |
conflation_turn anticipation.conflation_turn |
LLM judge | int | Not extracted | The turn index at which the agent's scope started expanding beyond the original request — when it began solving adjacent problems, refactoring unrelated code, or introducing unrequested changes. null if scope stayed clean. |
⚠ Early conflation (turn 2-3) = agent misread scope from the start. Late conflation (turn 10+) = task boundaries dissolved during execution, often after partial success revealed adjacent issues. These are different failure modes. |
instruction_adherence ★ anticipation.instruction_adherence TRAIL annotation schema · RECAP |
LLM judge | enum | Not extracted | Did the agent follow the instruction as given? From TRAIL's annotation schema. Values: full / partial / ignored / contradicted. Distinct from correctness — the agent can follow instructions and still produce wrong output. |
✓ TRAIL provides human-annotated reference labels across 148 traces. Use as benchmark for LLM judge calibration. Gemini-2.5-Pro achieves only 11% on trace debugging — instruction adherence scoring needs careful judge design. |
goal_alignment_at_close anticipation.goal_alignment_at_close |
LLM judge | float | Not extracted | Semantic similarity between the user's original request and what the agent actually delivered at session end. Measures whether the final output addresses what was asked, not just whether it's technically correct. Low score with reward=1 = solved the benchmark but missed the point. |
⚠ This is the "taste" signal from the k-shape productivity post. Agents can produce correct code that misses the product goal entirely. Senior engineers catch this; junior engineers don't. Score separates CRUD velocity from genuine task completion. |
| Signal | Source | Type | Status | Description | Validity / Research Backing |
|---|---|---|---|---|---|
context_pollution_curve ★ dynamics.cp_curve[] getmaxim.ai · arxiv 2602.07338 · pi-mono |
Embedding | list[float] | Needs session JSONL | Per-turn CP = 1 − cosine_sim(turn_0_embed, turn_N_embed). Tracks semantic distance from the original task anchor across every turn. Sharp early rise = started wrong. Gradual then sudden jump = drift event mid-session. |
✓ No LLM needed — any sentence-transformer works. CP > 0.45 defined as severe misalignment in production systems. Shape of the curve distinguishes "started wrong" from "drifted mid-session." |
max_context_pollution dynamics.max_cp |
Embedding | float | Needs session JSONL | Peak context pollution score across all turns. Single scalar summary of how far the session drifted from the original task at its worst point. > 0.45 = severe. > 0.2 = moderate. |
✓ Scalar summary of the full curve. Useful for session-level bucketing and cluster labeling without reading the full turn sequence. |
drift_onset_turn dynamics.drift_onset_turn |
Embedding | int | Needs session JSONL | First turn where CP > 0.2 (moderate drift threshold). Turn 1–2 = agent misread from start. Turn 5–10 = mid-session drift. Turn > 15 = late drift, possibly scope bleed after partial success. |
✓ Early vs. late onset separates "started wrong" from "drifted" — two completely different failure modes that demand different fixes. |
correction_ratio ★ dynamics.correction_ratio COLING 2025 frustration detection · pi-mono · hermes |
Regex | float | Needs session JSONL | Fraction of user turns containing correction signals: "no", "that's not", "wrong", "undo", "revert", "actually", "wait", "stop", "that's incorrect". Pure regex — no model needed. High ratio = user spent most of the session steering the agent. |
⚠ COLING 2025 research confirms keyword-based approaches miss frustrated users who don't use explicit negations. Must be supplemented with message length trend and context pollution for full picture. Treats all corrections as equivalent — severity not captured. |
assumption_lock_turn ★ dynamics.assumption_lock_turn arxiv 2602.07338 (Lost in Conversation) · hermes-agent-reasoning-traces |
Embedding | int | Needs session JSONL | Turn where agent stopped updating its working model despite user corrections. Signature: correction_ratio rises but intent_coherence_curve flatlines. From arxiv 2602.07338 — "models lock in early assumptions and stop incorporating new user information." |
⚠ ~30% performance drop in multi-turn traced to this exact pattern. Models revert to "average user" statistical prior rather than tracking individual corrections. Flat coherence curve despite high correction_ratio is the clearest diagnostic. |
gd_inaction_score dynamics.gd_inaction |
Derived | float | Needs session JSONL | From arxiv 2505.02709: measures agent's failure to abandon a wrong approach even after evidence it isn't working. Operationalized as: files/functions in the initial plan that were never touched divided by total initial plan items. High = agent kept promising things it never executed. |
✓ Research paper tested on Claude 3.5 Sonnet — even top models show nonzero GD_inaction. Complementary to GD_actions: an agent can drift by commission (doing wrong things) or omission (failing to do the right things). |
gd_actions_score dynamics.gd_actions |
Derived | float | Needs session JSONL | From arxiv 2505.02709: fraction of agent actions (tool calls, file edits) directed at areas outside the original task scope. High = agent actively investing effort in the wrong direction — not just drifting but actively executing a wrong plan. |
⚠ Requires knowing the "correct" scope — in eval runs this comes from the gold patch. In live sessions needs LLM judge to estimate minimal required scope. Use scope_drift as a simpler proxy when gold solution unavailable. |
plan_graph_edit_distance dynamics.plan_ged |
Derived | float | Needs session JSONL | Graph Edit Distance between the plan extracted from the agent's first reasoning block and the plan at session close. From RECAP (arxiv 2509.04472). High GED = agent rewrote its approach entirely. Low GED = stayed on plan (good or bad depending on whether the plan was right). |
✓ No LLM judge needed — uses networkx GED + BERTScore for node label matching. RECAP provides reference implementation. Structurally captures plan revision that correction_ratio misses entirely (agent can rewrite plan silently without the user noticing). |
repetition_rate dynamics.repetition_rate |
Embedding | float | Needs session JSONL | Fraction of tool calls with cosine_sim > 0.95 to a prior call in the same session. The "stuck loop" signal. An agent running the same grep 4 times or reading the same file 3 times is stuck, not exploring. |
✓ Pure behavioral signal — no ground truth needed. Repetition is bad in all cases. Can be computed without embeddings using exact-match on (tool_name, args) pairs as a lighter proxy. |
user_msg_length_slope dynamics.user_msg_length_slope |
Structural | float | Needs session JSONL | Linear regression slope over user message token lengths across turns. Negative = user giving up, messages collapsing. Positive = user re-explaining at increasing length (agent not getting it). Near-zero = healthy back-and-forth. No model needed. |
✓ Zero model dependency — pure token counts from session JSONL. Negative slope is the clearest behavioral signal of silent session abandonment, which correction_ratio cannot detect (users stop correcting and just give up). |
session_failure_mode dynamics.session_failure_mode |
Derived | enum | Needs all above | Composite classification derived from the signals above. Values: clean / shifted_intent / assumption_locked / scope_bleed / stuck_loop / multi_intent_confusion. Decision tree over CP curve shape, correction_ratio, repetition_rate, and drift_onset_turn. |
⚠ This is the label that makes all other signals interpretable. Without it, high correction_ratio could be healthy collaboration or a broken session. The classification logic is deterministic — no LLM judge needed once input signals are computed. |
| Signal | Source | Type | Description | Notes |
|---|---|---|---|---|
| Outcome & Acceptance | ||||
session_outcome ★ user.session_outcome OpenTraces outcome block · pi-mono session events |
Structural | enum | How the session ended. See enum above. Gating label — without this, no satisfaction signal is interpretable. Derived from: last user message content, whether changes were committed, whether a new session started on same files. |
⚠ Hardest single signal to get right. "Abandoned" is ambiguous by definition. Collect the raw end-state and derive the label later. |
output_applied ★ user.output_applied OpenTraces git attribution · pi-mono |
Structural | bool | Did the user apply or commit the agent's output? Detectable from git state at session close — did touched_files end up in a commit? Strongest quality proxy available without asking the user anything. |
✓ Completely objective. No interpretation needed. Git commit after session = output was accepted. Already partially available via OpenTraces schema's git attribution layer. |
changes_reverted user.changes_reverted |
Structural | bool | Were agent-made changes reverted before session end? Detected via git diff between mid-session peak and session-close state. True = user undid the work. |
✓ Objective. Complementary to output_applied — together they produce a clean accept/reject signal. |
| Turn Structure | ||||
user_agent_turn_ratio ★ user.turn_ratio hermes-agent-reasoning-traces · pi-mono · TRAIL |
Structural | float | User turns divided by agent turns. Ratio near 1.0 = back-and-forth. Ratio < 0.5 = agent dominated the session, user mostly watching. Ratio > 1.5 = user was doing most of the steering. |
✓ Pure count from session JSONL. No interpretation baked in at collection time. |
user_turn_count user.user_turn_count |
Structural | int | Raw count of user turns. Baseline denominator for all ratio signals. Separate from total turn count. |
✓ Trivially extracted from session JSONL message roles. |
user_msg_lengths user.user_msg_lengths[] |
Structural | list[int] | Token length of each user message in order. Raw list — do not interpret at collection. Analysis can later derive slope, variance, drop-off patterns. |
✓ Collect raw. A shrinking trajectory, a flat trajectory, an expanding trajectory all mean different things depending on context — decide later. |
all_turn_timestamps user.turn_timestamps[] |
Structural | list[str] | ISO timestamp for every turn (both user and agent), in order. Raw — collect everything. Gaps, bursts, pacing patterns are all derivable later. A gap doesn't mean abandonment. A burst doesn't mean frustration. |
⚠ Do not label gaps as "gave up" or quick replies as "frustrated" at collection time. Timestamps are facts; what they mean is analysis. |
| Code Region Signals | ||||
region_edit_history ★ user.region_edit_history[] OpenTraces attribution (line ranges) · pi-mono branch summaries |
Structural | list[obj] | Per-edit record: {file, start_line, end_line, turn_index, timestamp}. Captures every edit at line-range granularity in sequence. Foundation for all "same area" analysis — coarser signals like touched_files lose this entirely. |
✓ This is the most important missing signal in the current inventory. File-level is too coarse. Function/line-range is the right unit for detecting an agent struggling with a specific piece of code. |
region_re_edit_count ★ user.region_re_edit_count derived from OpenTraces line-range attribution |
Derived | int | Count of distinct code regions (file + line range) that were edited more than once. Derived from region_edit_history using overlap detection. High count = agent kept revisiting the same code, unable to get it right in one pass. |
✓ Direct answer to "multiple edits over the same areas." Requires region_edit_history first. |
region_edit_convergence ★ user.region_edit_convergence[] derived from OpenTraces line-range attribution |
Derived | list[str] | For each multiply-edited region: was the diff size shrinking (converging — homing in on the fix) or growing/oscillating (diverging — agent rewriting the same code repeatedly without progress)? Values per region: converging / oscillating / expanding. |
⚠ Converging re-edits are fine — iterative refinement. Oscillating re-edits on the same region are the failure signal. Without this distinction, re-edit count alone is misleading. |
| Agent Behavior Toward User | ||||
agent_hedging_rate user.agent_hedging_rate |
Derived | float | Frequency of hedging phrases in agent turns: "I think", "might be", "probably", "I'm not sure", "I believe", "it seems". Rate per 100 tokens. Also collect as a per-turn list — rising rate across turns = agent becoming less certain as session progresses. |
✓ Pure regex. No model needed. A rising trajectory is more informative than the average — agent starting confident and becoming uncertain is a different pattern from uniform low confidence throughout. |
agent_hedging_curve ★ user.agent_hedging_curve[] hermes-agent-reasoning-traces · pi-mono assistant messages |
Derived | list[float] | Hedging rate per agent turn in order. Collect raw — do not summarize to a single number at collection time. The trajectory shape (flat, rising, falling, spiking) is the signal. |
✓ Collect raw list. Same philosophy as user_msg_lengths — the shape matters more than the mean. |
confirmation_requests user.confirmation_requests |
Structural | int | Number of times the agent asked the user to confirm before proceeding. "Should I...", "Do you want me to...", "Is it okay if...". Separate from clarification questions about task intent. |
⚠ High in complex/destructive operations is correct behavior. High on trivial operations = agent not confident in its own judgment. Context-dependent — collect the count and let analysis decide. |
time_to_first_edit_s ★ user.time_to_first_edit_s OpenTraces attempt-span schema · pi-mono timestamps |
Structural | float | Seconds from session start to first file modification. Long time = agent spent many turns exploring/reading before committing to an edit. Short time = agent localized immediately. Raw number — fast is not always better. |
✓ Derivable from region_edit_history timestamps vs. session start. No interpretation baked in. |
per_turn_agent_latency_s ★ user.per_turn_latency_s[] swival OpenTraces-compatible traces · pi-mono turn metadata |
Structural | list[float] | Wall-clock seconds per agent turn from receiving user message to completing response. Raw list. Slow turns during key decision points vs. fast mechanical turns are both informative — collect the full series. |
✓ Directly from turn timestamps in session JSONL. Useful for cost/quality tradeoff analysis when combined with token counts. |
| Session Infrastructure | ||||
compaction_events ★ user.compaction_events[] pi-mono compaction summaries · OpenTraces session lifecycle |
Claude | list[obj] | Context window compaction events from session JSONL — when the agent summarized earlier context to free up window space. Each event: {turn_index, tokens_before, tokens_after, timestamp}. More compactions = longer/denser session. Captured in pi-mono dataset as structured events. |
⚠ Compaction loses information. A session with many compaction events may have the agent working from a degraded summary of earlier turns — relevant for understanding late-session drift and instruction adherence failure. |
model_changes user.model_changes[] |
Claude | list[obj] | Model switches mid-session: {from_model, to_model, turn_index}. From session JSONL model_change events. User switching models during a session = something about the current model wasn't working for this task. |
✓ Structured event in session JSONL. Zero inference needed. A model switch is a factual event — what it means is analysis. |
tool_error_rate ★ user.tool_error_rate hermes-agent-reasoning-traces tool_response · TRAIL execution errors |
Claude | float | Fraction of tool calls that returned an error result. Bash commands that failed, files that didn't exist, edits that were rejected. Also collect as tool_errors_by_type[] per tool. High error rate = agent executing bad commands. |
⚠ Some tool errors are expected (agent probing for a file that might not exist). High consecutive errors on the same tool/args = stuck, not probing. |
tool_result_sizes user.tool_result_sizes[] |
Claude | list[int] | Token length of each tool call result in order. Large results (e.g., massive grep output, full file reads) contribute disproportionately to context growth. Collect raw per-call — useful for tracing context window explosion. |
✓ Direct from session JSONL tool result content. Combined with context_growth_curve explains exactly which tool calls caused context bloat. |
| Multi-Session Context | ||||
prior_sessions_same_repo user.prior_sessions_same_repo |
Structural | int | Count of prior sessions on the same repository. First session on a codebase is a cold start — different baseline for localization difficulty. Returning user has context the agent doesn't. |
✓ Derivable from session metadata (repo path + user ID). No analysis needed at collection time. |
prior_sessions_same_region ★ user.prior_sessions_same_region OpenTraces cross-session attribution · pi-mono session metadata |
Derived | int | Count of prior sessions that touched the same file+line-range as this session. Persistent re-visits to the same code region across sessions = either a genuinely hard area or a recurring failure to fix it properly. Requires region_edit_history across sessions. |
✓ Requires cross-session join on region_edit_history. Worth collecting — a region touched in 5 separate sessions is a signal nothing else surfaces. |
session_reopened_within user.session_reopened_within_s |
Structural | float | Seconds until the same user opened a new session on the same repo after this one ended. Null if no new session within 24h. Raw number — do not label as "failed" at collection time. A quick restart could be scope continuation, not failure. |
⚠ Collect raw. Whether a quick reopen means the first session failed is analysis, not collection. |