telemetry.sessions: true is set, so the agent writes per-turn JSONL to /root/.subq/agent/sessions/ inside each container —
but this directory is never collected as a Modal artifact. <think> tokens and per-turn token breakdown are unavailable from eval runs.
plannotator-phase events emit to events/stdout in real time during eval runs.
These are not blocked by the JSONL gap — they can be parsed from stdout.txt today.
Do not conflate these two issues.
{ "source": "/root/.subq/agent/sessions",
"destination": "sessions",
"type": "directory" }
attempt_index.
Treatment variables (attempt1ToolCap, retryToolCap) differ between attempts —
without recording them per-attempt, fair attribution is impossible.
| Signal | Source | Type | Status | Description | Why it matters |
|---|---|---|---|---|---|
attempt1_tool_cap ★ treatment.attempt1_tool_cap |
Harness | int | Not recorded | Maximum tool calls allowed for the first attempt. From harness config attempt1ToolCap. This is the primary knob for controlling agent exploration depth. |
An agent that resolved a task with 50 tool calls allowed cannot be fairly compared to one with 20. Must be recorded. |
retry_tool_cap ★ treatment.retry_tool_cap |
Harness | int | Not recorded | Maximum tool calls allowed for retry attempts. Different from attempt1 — retries typically get more budget. From harness config retryToolCap. |
Retry performance is meaningless without knowing the retry budget. Two agents with different retry caps cannot be compared. |
agent_type ★ treatment.agent_type |
Harness | str | Not recorded | Which agent scaffold was used: one-shot, subq, shotgun, mini-swe-agent, etc. From --agent CLI flag. Already in eval-history.jsonl but not in per-attempt result.json. |
Must be in the attempt-span schema, not just the run-level log. |
model_id ★ treatment.model_id |
Harness | str | Not recorded | Exact model identifier used for this attempt. From AGENT_MODEL env or config. Include the full string, not just the provider — anthropic/claude-sonnet-4 vs. claude-sonnet-4-20250514 matters. |
Without this, you can't separate model quality from scaffold quality. |
dataset treatment.dataset |
Harness | str | Not recorded | Dataset and version: swebench-verified@1.0, swesmith@1.0, etc. Different datasets have different difficulty distributions. |
Same agent, same model — different resolve rate on different datasets. Must be recorded for any cross-run comparison. |
concurrency treatment.concurrency |
Harness | int | Not recorded | Number of parallel containers for this run. Affects inference server load, which affects wall_time_s. From --concurrency flag. |
Explains variance in wall_time_s across runs with different concurrency levels. |
timeout_s treatment.timeout_s |
Harness | int | Not recorded | Container timeout in seconds. Different timeout limits change the feasible set of tasks — Sphinx tasks need 50+ min. |
A run with 30 min timeout will have a structurally lower resolve rate on Sphinx tasks than one with 60 min. |
stdout.txt [subq-eval] lines and result.json. Highest reliability — source-of-truth from the harness.
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
turn_count behavior.turn_count |
Harness | int | Extracted | Number of agent turns before termination. Parsed from [subq-eval] Turns: N. |
⊕ Poor proxy alone. Turn count conflates task difficulty with agent efficiency. A 3-turn success on a hard task ≠ a 3-turn failure on an easy one. Use conditioned on task difficulty. |
tool_call_count behavior.tool_call_count |
Harness | int | Extracted | Total tool calls made across the attempt. From [subq-eval] Tool calls: N. |
⊕ Counts bash and read equally. A 40-call success with 38 reads looks the same as 40 edits. Decompose by tool type for signal quality. |
edit_calls behavior.edit_calls |
Harness | int | Extracted | Calls to the edit / str_replace_editor tool specifically. From [subq-eval] Edit calls: N. |
✓ More targeted than total tool calls. But incomplete — does not capture bash-mediated writes (see bash_mediated_edits). |
bash_mediated_edits ★ behavior.bash_mediated_edits pi-mono · swival · OpenTraces |
Harness | bool | Extracted | True when edit_calls = 0 but touched_files ≠ ∅. Agent modified files via bash (sed -i, heredocs) — invisible to the edit counter. |
⊕ Common failure pattern in evaluated traces. These patches are structurally different (no diff capture at call time). Worth flagging as a separate behavioral cluster. |
touched_files behavior.touched_files[] |
Harness | list[str] | Extracted | File paths modified during the attempt. From [subq-eval] Touched files: path (N diff lines). Includes bash-modified files. |
✓ Reliable. Derived from diff against pre-attempt snapshot, not from tool call logging. Gold standard for "did the agent change anything." |
diff_lines behavior.diff_lines |
Harness | int | Extracted | Total lines changed across all touched files. Sum of per-file diff line counts from stdout. |
⊕ Large diffs aren't always better. Reformatting or whitespace changes inflate this. Correlates weakly with correctness alone. |
tool_sequence ★ behavior.tool_sequence[] hermes-agent-reasoning-traces · pi-mono · OpenTraces TAO loop |
Harness | list[str] | Not extracted | Ordered list of tool names called. Enables bigram/trigram analysis: read→edit, bash→bash→bash (stuck loop), edit→bash(test) (verify-after-edit). |
⊕ Requires parsing full stdout interleave, not just summary lines. High value for behavioral clustering. Not yet implemented. |
localization_mode behavior.localization_mode |
Harness | enum | Extracted | Controller classification of how the attempt was located: attempt1Mode, trialRecoveryMode, finalFailureMode. From LocalizationSignal on HarborResult. |
✓ Structural signal from the harness controller, not the agent. Clean separation of concerns. |
| Localization Quality — wrong file = no chance (NEW) | |||||
localized_correct_file ★ behavior.localized_correct_file |
Derived | bool | Not extracted | Did the agent's first edit target a file that appears in the gold patch? touched_files[0] ∈ gold_patch_files. Wrong file on first edit = localization failure, distinct from patch quality failure. |
✓ Separates "found the right place but wrote the wrong fix" from "never found the right place." These have completely different root causes: the first is a reasoning failure, the second is a search/codebase-nav failure. Derivable from existing data + gold patch. |
files_searched_before_edit ★ behavior.files_searched_before_edit |
Harness | int | Not extracted | How many distinct files the agent read/grepped before making its first edit. Proxy for localization thoroughness. |
⊕ Requires stdout parsing. Over-exploration (high reads, delayed edit) correlates with uncertainty and lower patch quality in SWE-bench settings. But 0 reads before edit = blind patching, also bad. |
| Derived | |||||
files_read_before_edit behavior.files_read_before_edit |
Harness | int | Not extracted | How many distinct files the agent read (Read tool only) before making its first edit. Narrower than files_searched_before_edit. |
⊕ Requires stdout parsing. |
edit_to_read_ratio behavior.edit_to_read_ratio (derived) |
Derived | float | Not extracted | Ratio of edit calls (including bash-mediated) to read calls. High ratio = decisive agent; low ratio = over-exploratory or stuck. |
⊕ Needs tool_sequence first. Good discriminator between behavioral clusters once sequence data exists. |
~/.subq/agent/sessions/*.jsonl),
which is not collected from Modal eval runs.
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
| Available now — from stdout events | |||||
plannotator_phases ★ reasoning.plannotator_phases[] pi-mono custom_message events · hermes-agent-reasoning-traces |
Stdout | list[obj] | Available | Plannotator event objects from stdout: {"event":"plannotator-phase","phase":"implement"}. Ground-truth phase timeline. Emits in real time — does NOT require JSONL collection. |
✓ Structured, agent-emitted, no inference required. Best leading indicator of agent trajectory quality. Parse from stdout.txt today. |
phase_transitions ★ reasoning.phase_transitions[] |
Stdout | list[str] | Available | Sequence of detected phase labels: explore → localize → implement → verify. Derived from plannotator_phases. Already available from stdout. |
✓ Zero parsing heuristic needed — purely structural once plannotator events are collected. |
| Blocked — requires session JSONL from Modal | |||||
has_reasoning_blocks reasoning.has_reasoning_blocks |
Claude | bool | Needs artifact | Whether the model emitted extended thinking blocks (<think> tokens) in any turn. In session JSONL: message events with type: "thinking" content blocks. |
⊕ Binary presence is weak; length distribution is stronger. |
reasoning_tokens_total reasoning.reasoning_tokens_total |
Claude | int | Needs artifact | Sum of output tokens attributed to thinking blocks across all turns. Requires summing content[].text.length for thinking blocks from JSONL. |
⊕ Reasoning length has diminishing returns and can anti-correlate with performance on straightforward tasks (overthinking). |
plan_steps_detected reasoning.plan_steps_detected |
Claude | int | Needs artifact | Heuristic count of numbered/bulleted plan steps in the first assistant turn's text. Proxy for whether the agent formed an explicit plan before acting. |
⊕ Regex-based, fragile. Agents can reason implicitly without numbered steps. Use as soft signal, not hard feature. |
self_correction_count reasoning.self_correction_count |
Claude | int | Needs artifact | Times the agent explicitly revisited or abandoned a prior approach (“actually…”, “wait, that won't work…”). Detectable from reasoning text in session JSONL. |
⊕ Direction is ambiguous in coding eval context. Correction cycles often correlate with confusion and looping, not capability. In SWE-bench data, high self-correction on failed tasks = stuck agent revisiting the same wrong approach. Must be paired with reward before interpreting. Useful signal but not unconditionally positive. |
verifier/report.json and result.json.
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
reward ★ verification.reward TRAIL · swival · SWE-bench |
Harness | float | Extracted | Binary reward: 1.0 = all FAIL_TO_PASS tests pass with no regressions. 0.0 otherwise. From result.json. |
✓ Ground truth for autoresearch metric. But binary — misses partial progress. Use alongside patch_outcome_label for richer signal. |
patch_outcome_label verification.patch_outcome_label |
Derived | enum | Not extracted | 5-value enum derived from verifier test counts. See enum block above. Computed from verifier/report.json FAIL_TO_PASS / PASS_TO_FAIL counts. |
✓ Discriminates partial progress and regressions that reward=0 collapses together. Critical for failure analysis and cluster labeling. |
fail_to_pass_count ★ verification.fail_to_pass_count TRAIL · SWE-bench verifier |
Harness | int | Extracted | Tests that were failing before the patch and pass after. The primary "fixed" signal. From verifier/report.json → FAIL_TO_PASS. |
✓ Gold signal. Task-level ground truth. Use as the numerator for partial fix rate. |
pass_to_fail_count ★ verification.pass_to_fail_count TRAIL · SWE-bench verifier |
Harness | int | Extracted | Tests that were passing before and fail after — regressions introduced by the patch. From verifier/report.json → PASS_TO_FAIL. |
✓ Any non-zero value is a quality signal. Regressions should be weighted heavily in any scoring function. |
pass_to_pass_count verification.pass_to_pass_count |
Harness | int | Extracted | Tests that passed before and pass after — preserved behavior. From verifier/report.json → PASS_TO_PASS. |
⊕ High counts expected on every run. Useful only as denominator for regression rate, not as a quality signal by itself. |
fail_to_fail_count verification.fail_to_fail_count |
Harness | int | Extracted | Tests that failed before and still fail after — no progress on these. From verifier/report.json → FAIL_TO_FAIL. |
⊕ High count means partial fix or complete miss. Good for diagnosing scope of remaining failures. |
patch_applies verification.patch_applies |
Harness | bool | Extracted | Whether the generated patch applied cleanly to the codebase. False → unverifiable outcome. From result.json. |
✓ Pre-condition for all other verification signals. Must gate the entire verification layer. |
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
wall_time_s ★ infra.wall_time_s |
Harness | float | Extracted | Wall clock time for the attempt in seconds. From result.json → duration. |
⊕ Affected by container cold start, inference server load, concurrency. Not pure agent time. Use as relative comparator within a run, not across runs. |
timed_out ★ infra.timed_out |
Harness | bool | Extracted | Whether the container hit its timeout limit before completing. Sphinx tasks routinely timeout at 50 min. From result.json. |
⊕ Timeout rate is task-dependent (Sphinx >> Django on verified-mini). Must segment by task set before using as agent quality signal. |
attempt_index infra.attempt_index |
Harness | int | Extracted | Which attempt within the trial this was (0-indexed). Combined with localization_mode, indicates whether recovery was triggered. |
✓ Essential for decomposing trial-level signals into attempt-span signals. Required for the attempt-span schema to be meaningful. |
input_tokens_total ★ tokens.input_tokens_total |
Claude | int | Needs artifact | Total input tokens consumed across all turns. In session JSONL: sum of usage.input per message event. |
⊕ Grows monotonically (context accumulation). Context growth curve is more informative than total. |
cache_read_ratio ★ tokens.cache_read_ratio |
Derived | float | Needs artifact | cacheRead / (input + cacheRead) per turn. From session JSONL usage.cacheRead. High ratio = efficient context reuse. |
✓ Direct cost efficiency proxy. Cache reads are ~10x cheaper than input tokens on Claude. Low ratio on long attempts suggests cache invalidation or poor prompt structure. |
cost_usd ★ tokens.cost_usd |
Claude | float | Needs artifact | Estimated total USD cost of the attempt. From session JSONL usage.cost object. Aggregated across all turns. |
⊕ Cost per resolved task (cost / reward) is the key metric for optimization. Raw cost without outcome is incomplete. |
context_growth_curve ★ tokens.context_growth_curve[] OpenTraces tokens schema · pi-mono session traces |
Derived | list[int] | Needs artifact | Input token count at each turn. From session JSONL: [usage.input for each message event in order]. Shows context window growth trajectory. |
⊕ Confirmed in local sessions: 9k → 45k input tokens over ~15 turns. Fast growth = agent accumulating large context or codebase dumps. Plateau = efficient summarization. |
patch_outcome_label and fail_to_pass_count do not exist in product sessions (no test harness).
| Signal | Source | Type | Status | Description | Validity / Pitfalls |
|---|---|---|---|---|---|
task_paraphrase_accuracy anticipation.task_paraphrase_accuracy |
LLM judge | float | Not extracted | Semantic similarity between the user's original request and the agent's restatement of the task in its first reasoning block or response. Low score = agent misread intent from the start. |
⊕ Requires session JSONL + LLM judge. TRAIL dataset provides human-annotated instruction_adherence labels as a reference schema. First-turn restatement is the earliest detectable signal of understanding failure. |
plan_optimality ★ anticipation.plan_optimality TRAIL (human-annotated) · hermes-agent-reasoning-traces |
LLM judge | enum | Not extracted | Was the initial plan the agent formed close to optimal for the task? Drawn from TRAIL's annotation schema. Values: optimal / suboptimal / wrong_direction. Compare first reasoning block plan to eventual solution path. |
✓ TRAIL benchmark shows best models achieve only 11% accuracy detecting planning errors — this is genuinely hard. High-value signal precisely because it's hard to fake. |
first_tool_precision anticipation.first_tool_precision |
Harness | bool | Not extracted | Did the agent's first substantive tool call go to a file that ended up in touched_files? True = agent localized correctly from the start. |
✓ Derivable from existing harness data once tool_sequence is extracted. Clearest behavioral signal of genuine task understanding vs. exploratory guessing. |
scope_drift anticipation.scope_drift |
Derived | float | Not extracted | Ratio of files touched to files minimally required. Score of 1.0 = minimal footprint. Score of 3.0 = agent touched 3x more than needed. |
⊕ "Minimally required" needs LLM judge in live sessions. For eval runs: compare touched files to the gold patch's file set. |
clarification_sought anticipation.clarification_sought |
Claude | bool | Needs artifact | Did the agent ask a clarifying question before acting? First assistant turn contains a question directed at the user rather than a plan or tool call. |
⊕ Context-dependent. On a vague prompt it's correct behavior. On a clear prompt it signals the agent didn't read carefully. |
assumption_count anticipation.assumption_count |
Claude | int | Needs artifact | Heuristic count of assumption-signaling phrases in the first reasoning block: "I'll assume", "assuming", "I think you mean", "probably wants", "likely refers to". |
⊕ High assumption count on an underspecified prompt is normal. High on a detailed prompt is a red flag. Attribution depends on prompt quality score. |
prompt_underspecification anticipation.prompt_underspecification |
Derived | float | Not extracted | How vague was the original prompt? Composite of: prompt token length, absence of specific file/function references, absence of expected outcome description. Gates attribution — a low score means frustration signals are unreliable as agent quality labels. |
⊕ Critical gating signal. Without this, frustration from the user gets misattributed to agent failure. |
conflation_turn anticipation.conflation_turn |
LLM judge | int | Not extracted | The turn index at which the agent's scope started expanding beyond the original request. null if scope stayed clean. |
⊕ Early conflation (turn 2-3) = agent misread scope from the start. Late conflation (turn 10+) = task boundaries dissolved during execution. Different failure modes. |
instruction_adherence ★ anticipation.instruction_adherence TRAIL annotation schema · RECAP |
LLM judge | enum | Not extracted | Did the agent follow the instruction as given? Values: full / partial / ignored / contradicted. Distinct from correctness. |
✓ TRAIL provides human-annotated reference labels across 148 traces. Gemini-2.5-Pro achieves only 11% on trace debugging — instruction adherence scoring needs careful judge design. |
goal_alignment_at_close anticipation.goal_alignment_at_close |
LLM judge | float | Not extracted | Semantic similarity between the user's original request and what the agent actually delivered at session end. Low score with high quality output = solved something, but not what was asked. |
⊕ The "taste" signal. Agents can produce correct code that misses the product goal entirely. Senior engineers catch this; junior engineers don't. |
| Signal | Source | Type | Status | Description | Validity / Research Backing |
|---|---|---|---|---|---|
context_pollution_curve ★ dynamics.cp_curve[] getmaxim.ai · arxiv 2602.07338 · pi-mono |
Embedding | list[float] | Needs session JSONL | Per-turn CP = 1 − cosine_sim(turn_0_embed, turn_N_embed). Tracks semantic distance from the original task anchor across every turn. |
✓ No LLM needed — any sentence-transformer works. CP > 0.45 defined as severe misalignment in production systems. Shape of the curve distinguishes "started wrong" from "drifted mid-session." |
max_context_pollution dynamics.max_cp |
Embedding | float | Needs session JSONL | Peak context pollution score across all turns. > 0.45 = severe. > 0.2 = moderate. |
✓ Scalar summary of the full curve. Useful for session-level bucketing. |
drift_onset_turn dynamics.drift_onset_turn |
Embedding | int | Needs session JSONL | First turn where CP > 0.2. Turn 1–2 = agent misread from start. Turn 5–10 = mid-session drift. Turn > 15 = late drift, possibly scope bleed after partial success. |
✓ Early vs. late onset separates "started wrong" from "drifted" — two different failure modes that demand different fixes. |
correction_ratio ★ dynamics.correction_ratio COLING 2025 frustration detection · pi-mono · hermes |
Regex | float | Needs session JSONL | Fraction of user turns containing correction signals: "no", "that's not", "wrong", "undo", "revert", "actually", "wait", "stop". Pure regex — no model needed. |
⊕ COLING 2025 confirms keyword-based approaches miss frustrated users who don't use explicit negations. Must supplement with message length trend and context pollution. |
assumption_lock_turn ★ dynamics.assumption_lock_turn arxiv 2602.07338 (Lost in Conversation) · hermes-agent-reasoning-traces |
Embedding | int | Needs session JSONL | Turn where agent stopped updating its working model despite user corrections. Signature: correction_ratio rises but intent_coherence_curve flatlines. |
⊕ ~30% performance drop in multi-turn chat agents (arxiv 2602.07338) traced to this pattern. Caveat: this stat is from chat agents, not coding evals. In SWE-bench one-shot evals, assumption lock barely exists — the agent doesn't receive user corrections. This signal is valid for product sessions only. |
gd_inaction_score dynamics.gd_inaction |
Derived | float | Needs session JSONL | From arxiv 2505.02709: measures agent's failure to abandon a wrong approach even after evidence it isn't working. Files/functions in the initial plan that were never touched divided by total initial plan items. |
✓ Research paper tested on Claude 3.5 Sonnet — even top models show nonzero GD_inaction. Complementary to GD_actions. |
gd_actions_score dynamics.gd_actions |
Derived | float | Needs session JSONL | From arxiv 2505.02709: fraction of agent actions directed at areas outside the original task scope. High = agent actively investing effort in the wrong direction. |
⊕ Requires knowing the "correct" scope — in eval runs this comes from the gold patch. In live sessions needs LLM judge. |
plan_graph_edit_distance dynamics.plan_ged |
Derived | float | Needs session JSONL | Graph Edit Distance between the plan from the first reasoning block and the plan at session close. From RECAP (arxiv 2509.04472). High GED = agent rewrote its approach entirely. |
✓ No LLM judge needed — uses networkx GED + BERTScore for node label matching. Captures plan revision that correction_ratio misses. |
repetition_rate dynamics.repetition_rate |
Embedding | float | Needs session JSONL | Fraction of tool calls with cosine_sim > 0.95 to a prior call in the same session. The "stuck loop" signal. |
✓ Pure behavioral signal. Repetition is bad in all cases. Can use exact-match on (tool_name, args) as lighter proxy. |
user_msg_length_slope dynamics.user_msg_length_slope |
Structural | float | Needs session JSONL | Linear regression slope over user message token lengths across turns. Negative = user giving up. Positive = user re-explaining. Near-zero = healthy. No model needed. |
✓ Zero model dependency. Negative slope is the clearest signal of silent session abandonment, which correction_ratio cannot detect. |
session_failure_mode dynamics.session_failure_mode |
Derived | enum | Needs all above | Composite classification: clean / shifted_intent / assumption_locked / scope_bleed / stuck_loop / multi_intent_confusion. Decision tree over CP curve shape, correction_ratio, repetition_rate, and drift_onset_turn. |
⊕ This is the label that makes all other signals interpretable. The classification logic is deterministic once input signals are computed. |
| Signal | Source | Type | Description | Notes |
|---|---|---|---|---|
| Outcome & Acceptance | ||||
session_outcome ★ user.session_outcome OpenTraces outcome block · pi-mono session events |
Structural | enum | How the session ended. See enum above. Gating label — without this, no satisfaction signal is interpretable. |
⊕ Hardest single signal to get right. "Abandoned" is ambiguous by definition. Collect the raw end-state and derive the label later. |
output_applied ★ user.output_applied OpenTraces git attribution · pi-mono |
Structural | bool | Did the user apply or commit the agent's output? Detectable from git state at session close. Strongest quality proxy without asking the user. |
✓ Completely objective. Git commit after session = output was accepted. |
changes_reverted user.changes_reverted |
Structural | bool | Were agent-made changes reverted before session end? Detected via git diff between mid-session peak and session-close state. |
✓ Objective. Complementary to output_applied. |
| Turn Structure | ||||
user_agent_turn_ratio ★ user.turn_ratio hermes-agent-reasoning-traces · pi-mono · TRAIL |
Structural | float | User turns divided by agent turns. Ratio near 1.0 = back-and-forth. Ratio < 0.5 = agent dominated. Ratio > 1.5 = user steering. |
✓ Pure count from session JSONL. |
user_turn_count user.user_turn_count |
Structural | int | Raw count of user turns. Baseline denominator for all ratio signals. |
✓ Trivially extracted from session JSONL message roles. |
user_msg_lengths user.user_msg_lengths[] |
Structural | list[int] | Token length of each user message in order. Raw list — do not interpret at collection. |
✓ Collect raw. Analysis can later derive slope, variance, drop-off patterns. |
all_turn_timestamps user.turn_timestamps[] |
Structural | list[str] | ISO timestamp for every turn (both user and agent), in order. Raw — collect everything. |
⊕ Do not label gaps as "gave up" at collection time. Timestamps are facts. |
| Code Region Signals | ||||
region_edit_history ★ user.region_edit_history[] OpenTraces attribution (line ranges) · pi-mono branch summaries |
Structural | list[obj] | Per-edit record: {file, start_line, end_line, turn_index, timestamp}. Foundation for all "same area" analysis. |
✓ Most important missing signal. File-level is too coarse. Function/line-range is the right unit. |
region_re_edit_count ★ user.region_re_edit_count derived from OpenTraces line-range attribution |
Derived | int | Count of distinct code regions edited more than once. Derived from region_edit_history using overlap detection. |
✓ Direct answer to "multiple edits over the same areas." Requires region_edit_history first. |
region_edit_convergence ★ user.region_edit_convergence[] derived from OpenTraces line-range attribution |
Derived | list[str] | For each multiply-edited region: was the diff size shrinking (converging) or growing/oscillating (diverging)? Values: converging / oscillating / expanding. |
⊕ Converging re-edits are fine — iterative refinement. Oscillating is the failure signal. Without this distinction, re-edit count alone is misleading. |
| Agent Behavior Toward User | ||||
agent_hedging_rate user.agent_hedging_rate |
Derived | float | Frequency of hedging phrases: "I think", "might be", "probably", "I'm not sure". Rate per 100 tokens. |
✓ Pure regex. A rising trajectory is more informative than the average. |
agent_hedging_curve ★ user.agent_hedging_curve[] hermes-agent-reasoning-traces · pi-mono assistant messages |
Derived | list[float] | Hedging rate per agent turn in order. Collect raw — the trajectory shape is the signal. |
✓ Collect raw list. Same philosophy as user_msg_lengths. |
confirmation_requests user.confirmation_requests |
Structural | int | Times the agent asked the user to confirm before proceeding. "Should I...", "Do you want me to...". |
⊕ High on complex operations is correct behavior. High on trivial operations = agent not confident. |
time_to_first_edit_s ★ user.time_to_first_edit_s OpenTraces attempt-span schema · pi-mono timestamps |
Structural | float | Seconds from session start to first file modification. Raw number — fast is not always better. |
✓ Derivable from timestamps. No interpretation baked in. |
per_turn_agent_latency_s ★ user.per_turn_latency_s[] swival OpenTraces-compatible traces · pi-mono turn metadata |
Structural | list[float] | Wall-clock seconds per agent turn. Collect the full series. |
✓ Directly from turn timestamps in session JSONL. |
| Session Infrastructure | ||||
compaction_events ★ user.compaction_events[] pi-mono compaction summaries · OpenTraces session lifecycle |
Claude | list[obj] | Context window compaction events. Each event: {turn_index, tokens_before, tokens_after, timestamp}. |
⊕ Compaction loses information. Relevant for understanding late-session drift and instruction adherence failure. |
model_changes user.model_changes[] |
Claude | list[obj] | Model switches mid-session: {from_model, to_model, turn_index}. |
✓ Structured event in session JSONL. Zero inference needed. |
tool_error_rate ★ user.tool_error_rate hermes-agent-reasoning-traces tool_response · TRAIL execution errors |
Claude | float | Fraction of tool calls that returned an error result. Also collect as tool_errors_by_type[] per tool. |
⊕ Some tool errors are expected (probing). High consecutive errors on the same tool/args = stuck. |
tool_result_sizes user.tool_result_sizes[] |
Claude | list[int] | Token length of each tool call result in order. Large results contribute disproportionately to context growth. |
✓ Combined with context_growth_curve explains which tool calls caused context bloat. |
| Multi-Session Context | ||||
prior_sessions_same_repo user.prior_sessions_same_repo |
Structural | int | Count of prior sessions on the same repository. |
✓ Derivable from session metadata. |
prior_sessions_same_region ★ user.prior_sessions_same_region OpenTraces cross-session attribution · pi-mono session metadata |
Derived | int | Count of prior sessions that touched the same file+line-range as this session. A region touched in 5 separate sessions is a signal nothing else surfaces. |
✓ Requires cross-session join on region_edit_history. |
session_reopened_within user.session_reopened_within_s |
Structural | float | Seconds until the same user opened a new session on the same repo after this one ended. Null if no new session within 24h. |
⊕ Collect raw. Whether a quick reopen means the first session failed is analysis, not collection. |