# UltraCode Goal (UCG) Documentation (Full)

> Complete documentation for AI consumption
> Generated: 2026-06-04
> Repository: https://github.com/armelhbobdad/bmad-module-ultracode-goal

<document path="404.md">
## 404 — this URL does not pass

No page resolves to that address. In the spirit of the rest of this project, a claim that cannot be verified does not get to advance — and that includes a URL the docs site made to itself.

Use the **search bar** at the top of the page, or jump to one of these pages that *do* resolve, grouped the way the sidebar groups them:

**Why** — [Why UltraCode Goal](./why-ultracode-goal.md)

**Try** — [Getting Started](./getting-started.md) · [How It Works](./how-it-works.md) · [Parallel Mode](./parallel-mode.md)

**Reference** — [Architecture](./architecture.md) · [Gate Model](./gate-model.md) · [Health Check](./health-check.md) · [Troubleshooting](./troubleshooting.md)

---

### Think this page should exist?

A missing page is drift — a link the docs made to themselves that no longer reads green. [Open an issue](https://github.com/armelhbobdad/bmad-module-ultracode-goal/issues/new/choose) with the URL you tried. The same instinct that makes UltraCode Goal re-loop on a failing gate applies here: if a broken path slipped through, it is a defect worth reporting.
</document>

<document path="architecture.md">
UltraCode Goal is a conductor. It orchestrates the installed BMAD epic toolbox and the TEA gates, composing Claude Code primitives — `/goal`, Auto Mode, Auto Memory, hooks, git/worktree isolation — and replaces none of them. This page covers the conductor model, the three enforcement layers in depth, the file layout, customization resolution, and why the hooks live where they do.

## The conductor model

The skill owns no implementation logic of its own for building features or running tests. What it owns is the *order*, the *gates*, and the *enforcement*. It delegates:

- **Epic toolbox** — `bmad-sprint-planning`, `bmad-create-story`, `bmad-check-implementation-readiness`, `bmad-dev-story`, `bmad-code-review`, `bmad-correct-course`, `bmad-sprint-status`, `bmad-retrospective`.
- **TEA gates** — `bmad-testarch-framework`, `-ci`, `-test-design`, `-atdd`, `-automate`, `-test-review`, `-nfr`, `-trace`.
- **Claude Code primitives** — the `/goal` loop drives execution; Auto Mode and ultracode session effort make the unattended run possible; Auto Memory carries learnings forward; hooks enforce invariants; git branches and worktrees provide isolation and rollback.

Because it is a conductor, the truth of "is this done" lives in the artifacts its delegates produce, not in the conductor's own reasoning. That is the whole design: the model arranges the work, but a script reads the verdict.

The conductor sits between three sets of things it does not own — the BMAD epic toolbox it orchestrates, the TEA gates it sequences, and the Claude Code primitives it composes:

```mermaid
flowchart TD
    UCG["UltraCode Goal conductor - owns order, gates, enforcement"]
    subgraph toolbox["BMAD epic toolbox"]
        SP["sprint-planning"]
        CS["create-story"]
        DS["dev-story"]
        CR["code-review"]
        CC["correct-course"]
    end
    subgraph tea["TEA gates"]
        TD["test-design"]
        ATDD["atdd"]
        TR["trace writes gate-decision.json"]
        NFR["nfr"]
    end
    subgraph cc["Claude Code primitives"]
        GOAL["/goal loop"]
        AUTO["Auto Mode"]
        MEM["Auto Memory"]
        HOOKS["PreToolUse + Stop hooks"]
        GIT["git branch + worktrees"]
    end
    UCG -->|"delegates building"| toolbox
    UCG -->|"sequences"| tea
    UCG -->|"composes"| cc
    classDef accent fill:#6366F1,stroke:#4F46E5,color:#fff
    class UCG accent
```

## The three enforcement layers

These are the module's non-negotiables. Each exists because the documented mechanics make the intuitive shortcut wrong (see [why](why-ultracode-goal.md)).

### 1. Deterministic gate truth

`scripts/gate_eval.py` reads TEA's `gate-decision.json` and maps its gate status to a routing verdict. It never re-derives TEA's thresholds and never reads the transcript. The `/goal` evaluator that drives execution can only see what the run surfaces — it cannot open the gate file — so it is structurally incapable of being the completion authority. The script is. See the [gate model](gate-model.md) for the full mapping, thresholds, and the fail-closed contract.

The mapping is fixed, and in production two extra signals can only downgrade an `advance`, never lift a lower verdict:

```mermaid
flowchart TD
    READ["gate_eval.py reads gate-decision.json"]
    READ --> ST{"gate_status"}
    ST -->|"PASS or WAIVED"| ADV["advance"]
    ST -->|"CONCERNS"| DEF["defer - park to ledger, keep moving"]
    ST -->|"FAIL"| REL["reloop - correct-course, re-run in budget"]
    ST -->|"NOT_EVALUATED"| ESC["escalate - stop"]
    ADV --> PROD{"production profile"}
    PROD -->|"nfr FAIL, review under 80, Block, or unreadable signal"| REL
    PROD -->|"both signals read and pass"| ADVOK["advance confirmed"]
    classDef verdict fill:#4F46E5,stroke:#3730A3,color:#fff
    class ADV,DEF,REL,ESC,ADVOK verdict
```

### 2. Hooks as invariants

Two invariants must hold for every commit, and neither can live in memory, which is context the model may or may not weigh:

- **`scripts/hooks/guard_pretooluse.py`** (PreToolUse) — inspects each `git commit`/`git push`. It denies the command on a protected branch, and denies a `git commit` when no tests-ran marker (`<impl-artifacts>/.tests-ran-<story_id>`) exists for the current story. It returns a `deny` decision in the hook JSON and also exits 2 with the reason on stderr so older clients that ignore the JSON still block.
- **`scripts/hooks/budget_stop.py`** (Stop) — counts turns and accumulated tokens for the current story against `max_turns_per_story` / `story_token_budget`. On overrun it writes an escalation marker and surfaces a message, then **lets the stop proceed**. Its documented limitation: a Stop hook fires only when Claude is already trying to stop, so it cannot interrupt a `/goal` condition mid-turn — the in-condition "stop after N turns" clause and the gate's re-loop budget are the real bounds; this hook is the third, defensive layer.

Both hooks read their config from env first (so the conductor injects per-run values) and fall back to hardcoded defaults (`main`/`master`, `25`, `1_500_000`, `ultracode/epic-`). Because of that fallback, a `customize.toml` override **silently no-ops at the enforcement layer** unless the conductor passes it through the hook env — so preflight injects `ULTRACODE_PROTECTED_BRANCHES`, `ULTRACODE_IMPL_ARTIFACTS`, `ULTRACODE_MAX_TURNS`, `ULTRACODE_TOKEN_BUDGET`, and `ULTRACODE_EPIC_BRANCH_PREFIX`.

### 3. Budget enforcement

A runaway story is bounded by three layers in order of authority: the **in-condition** "…or stop after N turns" clause inside the `/goal` condition (the real in-loop bound), the **gate re-loop budget** (a `reloop` that would exceed `max_turns_per_story` or `story_token_budget` becomes `escalate`), and the **Stop hook** as the defensive backstop described above. Rollback is git, not `/rewind` — an Epic branch off a protected branch, one commit per green story, worktree isolation under `--parallel` — because `/rewind` checkpoints miss the Bash-driven changes that make up the run.

## File layout

The skill routes from a thin entry point down to just-in-time stage files, deterministic scripts, and an experimental asset:

```
skills/ultracode-goal/
├── SKILL.md                       # Entry point: overview, conventions, run modes,
│                                  #   non-negotiables, the 6-stage table, headless contract
├── customize.toml                 # Config base layer (the [workflow] block)
├── references/                    # One file per stage, loaded just-in-time
│   ├── ingest-and-scope.md        #   Stage 1
│   ├── preflight.md               #   Stage 2 (the autonomy gate)
│   ├── define-done.md             #   Stage 3
│   ├── execute.md                 #   Stage 4
│   ├── gate.md                    #   Stage 5
│   └── finalize.md                #   Stage 6
├── scripts/                       # Deterministic truth (run via `uv`)
│   ├── preflight_check.py         #   mechanical preflight facts + blocker budget
│   ├── gate_eval.py               #   gate status -> verdict (the completion authority)
│   ├── health_check_fp.py         #   health-check fingerprint + seen-cache plumbing
│   └── hooks/
│       ├── guard_pretooluse.py    #   commit invariants (PreToolUse)
│       └── budget_stop.py         #   turn/token budget (Stop)
└── assets/
    └── execute-epic.workflow.js   # EXPERIMENTAL --parallel worktree fan-out
```

`SKILL.md` carries the routing and the contract; the `references/*.md` stage files carry the procedure and the testable routing conditions; the `scripts/*.py` files carry the deterministic facts the model cannot fudge; the `assets/*.js` workflow is the opt-in experimental execution path. See [how it works](how-it-works.md) for the stages and [parallel mode](parallel-mode.md) for the asset.

## Customization resolution

Configuration resolves in three layers, base → team → user, via `resolve_customization.py`:

1. **Base** — `customize.toml` in the skill root (the shipped `[workflow]` block).
2. **Team** — `{project-root}/_bmad/custom/ultracode-goal.toml`.
3. **User** — `{project-root}/_bmad/custom/ultracode-goal.user.toml`.

Merge semantics: **scalars override**, **tables deep-merge**, **arrays append**. At activation the skill runs `resolve_customization.py --skill {skill-root} --key workflow`; if that fails, it resolves the three files itself in the same order. The shipped base layer defines the run's knobs — the TEA/artifact paths (`tea_config_path`, `trace_output_dir`, `implementation_artifacts`, `deferred_work_path`), the git guardrails (`epic_branch_prefix`, `protected_branches`), the budgets (`max_turns_per_story`, `story_token_budget`), the experimental `parallel_max_concurrency`, the `allowlist_commands`, and the `on_epic_complete` hook. Teams and users override without editing the shipped file. Remember that a budget or branch override only reaches the *enforcement* layer because preflight threads it into the hook env (see layer 2 above).

The three TOML layers merge once, but a branch or budget value then travels two ways — the conductor reads it directly, while the hooks only see it if preflight re-injects it as env:

```mermaid
flowchart LR
    BASE["Base - customize.toml"]
    TEAM["Team - ultracode-goal.toml"]
    USER["User - ultracode-goal.user.toml"]
    BASE --> RES["resolve_customization.py merges base then team then user"]
    TEAM --> RES
    USER --> RES
    RES --> WF["resolved workflow block"]
    WF -->|"conductor reads scalars directly"| COND["conductor stages"]
    WF -->|"preflight injects ULTRACODE_* env"| HOOKS["PreToolUse + Stop hooks"]
    HOOKS -. "no env injected, falls back to defaults" .-> DROP["override no-ops at enforcement"]
    classDef accent fill:#6366F1,stroke:#4F46E5,color:#fff
    class WF accent
```

## Why the hooks live in settings.local.json (decision D6)

The PreToolUse and Stop hooks are auto-merged into `{project-root}/.claude/settings.local.json` — machine-local, gitignored, honored after the workspace trust dialog — not into a committed settings file or memory. The reasoning: these hooks are **enforcement, not context**. A committed hook would impose this module's commit guard on every contributor and every unrelated session in the repo; a hook in memory would not block a commit at all. The machine-local file scopes enforcement to the machine actually running the unattended Epic, and the gitignore keeps it out of shared history. The skill re-merges them every run (idempotently) and asserts they are active before the run goes unattended — it does not assume a prior run left them in place. Because the file is machine-local and executes on your machine, review what is merged; see [SECURITY.md](../SECURITY.md).
</document>

<document path="cross-session-recall.md">
> **Optional, opt-in, off by default.** Cross-Session Recall is an *advisory* read of your own past runs. It informs the model; it never decides anything. It is wired so it can **never** sit in the gate or completion path, it **fails closed** during a run, and when it is off — or when claude-mem is absent — the run is byte-for-byte the same as a run without it.

## What it does

Cross-Session Recall lets UCG learn from its own history in *this* codebase instead of starting amnesiac every Epic. If you run [claude-mem](https://github.com/thedotmack/claude-mem), this executor consults your past runs in this codebase before it scopes a new Epic and before its preflight gate — so it surfaces what bit you here last time instead of starting amnesiac. You stay in the driver's seat; it informs, it doesn't auto-decide.

The benefit compounds — your first Epic teaches it, your third Epic consults two.

## What you need

You need [claude-mem](https://github.com/thedotmack/claude-mem) installed and the `cross_session_recall` knob set to `on`. That is the whole dependency: UCG reads and writes through claude-mem's MCP tools when they are present, and does nothing recall-related when they are not.

> claude-mem is a third-party plugin maintained independently of this module. We don't bundle, endorse, or install it; Cross-Session Recall simply uses it when you already have it.

## The touchpoints

Recall touches the run in exactly three places, and a machine latch (`.mem-state.json`) gates every one of them. The latch is written **once** at Ingest and removed at Finalize close-out; while it is present a `PreToolUse` hook allows claude-mem calls only when recall is green, and denies them otherwise — fail closed. The Execute and Gate stages are never touched.

```mermaid
flowchart TD
    I["Stage 1 Ingest — write latch, advisory scope-recall read"] --> P{"Stage 2 Preflight — prior-failure recall reuses the same fetch"}
    P -->|"advisory only, voice never vote"| X["Stage 4 Execute — untouched"]
    X --> G["Stage 5 Gate — untouched, gate_eval.py only"]
    G --> F["Stage 6 Finalize — one structured write, then remove latch"]

    LATCH{".mem-state.json latch — present and green gates reads and writes; absent or not green denies, fail closed"}
    LATCH -.->|"gates"| I
    LATCH -.->|"gates"| P
    LATCH -.->|"gates"| F

    classDef gate fill:#6366F1,stroke:#4F46E5,color:#fff
    class P,LATCH gate
```

- **Ingest (read).** Before UCG scopes the Epic, one advisory read pulls prior run summaries for this repo, sanitizes them, and surfaces what recurred.
- **Preflight (read, reused).** The prior-failure recall reuses that same fetch — no second round-trip — so the preflight reasoning can see failures that bit you here before. It is advisory context, never a gate input.
- **Finalize (write).** At close-out UCG records exactly one structured observation summarizing the run's outcome and signatures. The optional retrospective reuses the recurrence counts already computed during the reads.

Outside a run — when the latch file is absent — the hook never touches your claude-mem usage at all. Recall is scoped strictly to an active UCG run.

## The trust model

The rule is **data, never directive**: recalled content is treated as facts to consider, never as instructions to follow, and it never reaches the gate.

What the sanitizer **does** before any advisory is surfaced:

- **Scopes to this repo.** A repository fingerprint pins recall to the same origin and root commit, so another project's history cannot leak in.
- **Redacts secrets.** High-precision patterns — AWS keys, `ghp_`/`gho_` tokens, `sk-` keys, bearer tokens, PEM headers, `password=`/`token=`/`api_key=` values — are replaced with `[redacted]`.
- **Neutralizes shape.** Bidi controls are stripped, backticks and code fences removed, newlines collapsed, and each surfaced title clamped to 80 codepoints — so an advisory cannot carry instruction-shaped or prompt-injection payloads.
- **Drops the stale and the foreign.** Records past the recency horizon are dropped (signatures that recurred across two or more distinct runs earn per-signal horizon immunity), foreign-project and cross-schema records are filtered out, and malformed records are discarded.

What it honestly does **not** do:

- It is **not** a complete secret scrubber — redaction is high-precision pattern matching, not a guarantee that nothing sensitive ever survives.
- It does **not** vote. Advisories carry only a mechanical `recurred` field (`yes`/`no`/`unknown`) — there are no LLM self-grades, and nothing recall surfaces is ever an input to `gate_eval.py`. The gate reads TEA's `gate-decision.json` and only that. See the [gate model](gate-model.md).
- It does **not** keep working when claude-mem looks wrong. If the capability contract fails — a missing tool, a malformed probe, a breaking schema change — the latch records recall as absent and the hook denies claude-mem calls for the run. Fail closed, never fail open.

## Turning it on

Set the knob in your project's `_bmad/custom/ultracode-goal.toml` (the same file the other knobs use):

```toml
[workflow]
# Cross-Session Recall — consult and record prior runs of this repo via claude-mem.
# Requires claude-mem installed; advisory only, never part of the gate. Off by default.
cross_session_recall = "on"
```

The `[workflow]` table header matters: the resolver extracts the `workflow` block from the merged files, so a bare top-level `cross_session_recall` line is silently discarded and the feature stays off.

With it `on` and claude-mem installed, the next run reads at Ingest and Preflight and writes one observation at Finalize. Nothing else about the run changes.

## Turning it off

Set the knob back to `off` (its default):

```toml
[workflow]
cross_session_recall = "off"
```

> **OFF-coherence disclosure.** Setting `cross_session_recall` to `off` disables **UCG's own** recall and write only. A separately-installed claude-mem still injects its session-start index into your Claude Code sessions — that is claude-mem's behavior, not UCG's. To stop that, configure or uninstall claude-mem itself.

## What happens when claude-mem is absent

Nothing — the run is identical. There is no fallback to emulate, no degraded path, no warning. UCG's control flow is the same whether claude-mem is installed or not; absent claude-mem, the recall touchpoints are simply no-ops and Finalize skips the write. A run with the knob `on` and claude-mem missing behaves exactly like a run with the knob `off`.

## Known limits — be honest

Recall is deliberately bounded. Treat these as the residuals you are signing up for:

- **A factual advisory still informs reasoning.** Sanitized advisories carry no instruction-shaped, stale, or foreign content and never reach the gate — but a well-formed factual advisory can still inform the model's reasoning. That is the feature, bounded; it is not a side channel into the verdict.
- **UCG serializes only its own writes.** It writes its one observation per run safely, but it cannot control other processes writing the same claude-mem store concurrently.
- **One structured write per run, in every mode.** UCG contributes exactly one structured observation at Finalize regardless of mode. Any *additional* auto-capture you see is claude-mem's own behavior and may differ in headless runs.
- **Redaction is high-precision, not complete.** Secret redaction is pattern matching tuned for precision; it is not a complete data-loss-prevention layer.
- **A breaking claude-mem change latches loudly absent.** If claude-mem ships a schema change that breaks the capability pin, recall latches as absent — loudly off, never silently wrong — until the pin is updated. Run the recall selftest to check the pin against your installed claude-mem.

## The default, and what would flip it

Cross-Session Recall ships **off**. That is the honest default: an advisory that reads your history is worth shipping on by default only once opt-in usage proves it earns its keep without ever touching a verdict.

The default flips to `on` only if, across sustained real-world opt-in use, recalled advisories are corroborated by run outcomes **and** there are zero gate-influence incidents. Until that bar is met, off is the honest default — and "stays off" is a perfectly valid outcome. The criterion is falsifiable on purpose: it is a claim that can fail, not a roadmap promise.
</document>

<document path="gate-model.md">
Completion in UltraCode Goal is decided by a deterministic artifact read, not by judgment. `scripts/gate_eval.py` reads TEA's `gate-decision.json` and returns a routing verdict the skill executes. This page documents the verdict mapping, the production AND, the thresholds, the fail-closed contract, and why the `/goal` evaluator alone is insufficient — all traced to [`../skills/ultracode-goal/scripts/gate_eval.py`](../skills/ultracode-goal/scripts/gate_eval.py).

## What the gate reads

The script resolves the gate artifact from the trace output directory:

1. It looks for a trace-report markdown whose frontmatter records the slim gate file (keys `gateDecisionFile` / `gateDecisionPath` / `gate_decision_path`), defaulting to `<trace-output>/gate-decision.json`.
2. If that slim file is absent, it falls back to the always-written `e2e-trace-summary.json` and lifts the gate fields from it. **The slim file's absence is normal, not a failure** — TEA only writes it when the run is gate-eligible and the decision is PASS/CONCERNS/FAIL/WAIVED.
3. If neither file is present, or the run carries no gate fields, `gate_status` is `NOT_EVALUATED`.

The script never re-derives TEA's thresholds; it reads `gate_status` as given by the trace workflow.

How the script resolves an artifact into a `gate_status`:

```mermaid
flowchart TD
    A["Scan trace-output for a trace report"] --> B{"Frontmatter names a slim gate file?"}
    B -->|"yes"| C["Use the hinted path"]
    B -->|"no"| D["Default to gate-decision.json"]
    C --> E{"Slim file present?"}
    D --> E
    E -->|"yes"| F["Read gate_status from slim file"]
    E -->|"no"| G{"e2e-trace-summary.json present?"}
    G -->|"yes, has gate fields"| H["Lift gate_status from summary"]
    G -->|"yes, no gate fields"| I["gate_status = NOT_EVALUATED"]
    G -->|"no"| I
    F --> Z["gate_status to verdict mapping"]
    H --> Z
    I --> Z
    classDef verdict fill:#4F46E5,stroke:#3730A3,color:#fff
    class Z verdict
```

## Verdict mapping

The gate status maps to a verdict (`GATE_VERDICT` in the script):

| `gate_status` | verdict | the skill does |
|---------------|---------|----------------|
| `PASS` | `advance` | story passes; move to the next story |
| `WAIVED` | `advance` | story passes; move to the next story |
| `CONCERNS` | `defer` | append non-blocking items to the ledger, then advance anyway |
| `FAIL` | `reloop` | run `bmad-correct-course`, re-run the story within budget |
| `NOT_EVALUATED` | `escalate` | stop — the gate could not be read |

Any unrecognized status escalates (the script's `GATE_VERDICT.get(gate_status, "escalate")` default), with a `reasons` entry noting it.

## The production AND

Under `--profile production`, an otherwise-`advance` verdict is additionally ANDed against two TEA signals, and any failure downgrades it to `reloop`. The downgrade floor is `reloop` — a `defer`/`reloop`/`escalate` is unchanged; only an `advance` moves:

- **NFR** (`nfr-assessment.md`): the audit's Overall Status must not be `FAIL`.
- **Test review** (`test-review.md`): the Quality Score must be `>= 80` **and** the Recommendation must not be `Block`.

How the AND folds the two signals in, with every unreadable path counting as a failure:

```mermaid
flowchart TD
    V{"Verdict is advance?"} -->|"no, defer/reloop/escalate"| K["Unchanged"]
    V -->|"yes"| N{"NFR Overall Status"}
    N -->|"FAIL"| F["Signal failed"]
    N -->|"file missing or unparsable"| F
    N -->|"parsed and not FAIL"| R{"Test review"}
    R -->|"score lt 80 or Block"| F
    R -->|"score unparsable, file missing"| F
    R -->|"score gte 80 and not Block"| P["Both signals passed"]
    F --> D["Downgrade advance to reloop"]
    P --> A["Stay advance"]
    classDef verdict fill:#4F46E5,stroke:#3730A3,color:#fff
    class A,D,K verdict
```

Under `--profile light` none of this applies — the trace gate is the whole decision.

## The thresholds

The P0/P1/overall percentage thresholds — **P0 = 100%, P1 >= 90%, overall >= 80%** — are decided **upstream by the TEA trace workflow** and written into the gate artifact; `gate_eval.py` reads the resulting `gate_status`, `p0_status`, `p1_status`, and `overall_status` rather than recomputing the percentages. The script's own production AND adds the two coarser signals above (NFR != FAIL, review score >= 80 and recommendation != Block). Do not restate or recompute the TEA percentages elsewhere — they are TEA-owned, and the test-design stage's job is only to assign the P0–P3 priorities honestly so those upstream thresholds key off real priorities.

## Fail-closed contract

The production AND is deliberately fail-closed (see the `apply_production_and` docstring in the script): a missing `nfr-assessment.md` or `test-review.md`, or a field the scanners cannot parse, is treated as a **failing** signal — not a neutral or absent one. So if TEA's prose format drifts and the Overall Status or Quality Score cannot be read, an otherwise-`advance` degrades to a conservative `reloop` rather than a silent false-advance. The direction is intentional: the module would rather re-loop a green story than advance a story whose evidence it could not actually read. Likewise, a missing or corrupt gate artifact yields `NOT_EVALUATED` → `escalate` — the gate is never assumed green.

This is reinforced by the routing invariant in Stage 5: **a P0/critical FAIL never defers.** Only non-gate-blocking work (CONCERNS, non-critical findings, parked decisions) reaches the deferred-work ledger; a FAIL or a P0/critical finding re-loops within budget or escalates.

## Output shape

The script prints one JSON object (`evaluate()` in the script):

```json
{
  "verdict": "advance|defer|reloop|escalate",
  "gate_status": "PASS|CONCERNS|FAIL|WAIVED|NOT_EVALUATED",
  "p0_status": "...",
  "p1_status": "...",
  "overall_status": "...",
  "nfr_status": "...",
  "review_score": 0,
  "reasons": ["..."]
}
```

### Example — a clean production advance

A story whose slim gate file reads `PASS`, with an NFR audit of `PASS` and a test review scoring 92 with an Approve recommendation, produces:

```json
{
  "verdict": "advance",
  "gate_status": "PASS",
  "p0_status": "PASS",
  "p1_status": "PASS",
  "overall_status": "PASS",
  "nfr_status": "PASS",
  "review_score": 92,
  "reasons": [
    "gate read from gate-decision.json",
    "gate_status PASS -> advance"
  ]
}
```

### Example — a production downgrade

The same `PASS` gate, but with a test review scoring 74, downgrades to `reloop` — the gate passed, but a production signal failed:

```json
{
  "verdict": "reloop",
  "gate_status": "PASS",
  "p0_status": "PASS",
  "p1_status": "PASS",
  "overall_status": "PASS",
  "nfr_status": "PASS",
  "review_score": 74,
  "reasons": [
    "gate read from gate-decision.json",
    "gate_status PASS -> advance",
    "test-review score 74 < 80",
    "production signal failed; advance downgraded to reloop"
  ]
}
```

## Why the `/goal` evaluator alone is insufficient

The `/goal` loop that drives Execute ends with an evaluator confirming the success condition — but that evaluator only sees the transcript. It cannot open `gate-decision.json`. So it can confirm "the tests I was shown printed green" but not "TEA's deterministic gate read PASS against the traceability matrix and the NFR thresholds." Letting it be the completion authority would let the run grade itself from its own notes. `gate_eval.py` reads the file the model cannot author, which is exactly why it — and not the transcript evaluator — decides. See [why](why-ultracode-goal.md) and the routing detail in [`references/gate.md`](../skills/ultracode-goal/references/gate.md).
</document>

<document path="getting-started.md">
Install UltraCode Goal into a BMAD project, point it at an Epic, and let it run that Epic to a gate-passed Definition-of-Done. This page covers prerequisites, install, the first-run walkthrough, and the run-mode flags.

## Prerequisites

UltraCode Goal conducts BMAD and TEA skills and runs deterministic Python under `uv`. You need:

| Tool | Required for | Install |
|------|--------------|---------|
| Claude Code | **The runtime — non-negotiable.** UCG composes `/goal`, Auto Mode, Auto Memory, and runtime hooks, which only exist in Claude Code; the autonomous run cannot execute anywhere else | <https://www.anthropic.com/claude-code> |
| Node.js >= 22 | Installation, `npx` commands | <https://nodejs.org> |
| Python >= 3.10 | The deterministic gate, preflight, and hook scripts (run via `uv`) | <https://www.python.org> |
| `uv` | Running the module's Python scripts with automatic dependency management | <https://docs.astral.sh/uv/> |
| `git` | Epic-branch isolation and per-story commits (the real rollback) | <https://git-scm.com> |
| `gh` (GitHub CLI) | Submitting or queuing [health-check](health-check.md) findings | <https://cli.github.com> |
| A BMAD project with an Epic | The unit of delivery — a `_bmad/` install, a `sprint-status.yaml`, and at least one Epic with stories | see [bmad-method.org](https://docs.bmad-method.org) |

The run also depends on recent Claude Code primitives: `/goal`, dynamic workflows, and Auto Memory. The preflight script version-gates these and reports a mechanical blocker if the installed Claude Code is below the minimum any of them needs (see [troubleshooting](troubleshooting.md)).

## Install

```bash
npx bmad-module-ultracode-goal install
```

The installer is interactive — it prompts for the project name and which IDEs to configure, then copies the skill into place. As an alternative, the module can be installed from the plugin marketplace entry (`.claude-plugin/marketplace.json`) the same way as other BMAD plugins.

## First run

Invoke the skill with one of its trigger phrases — "run an epic autonomously", "execute this epic", "ultracode goal", or "autonomously deliver the epic" — in a BMAD project.

1. **Name the Epic.** Stage 1 opens the floor: name the Epic, or drop any context (a story id, a branch, a paste of the Epic body). The skill fills the gaps from the BMAD artifacts. If `_bmad/` config, `sprint-status.yaml`, and any Epic are *all* absent, this is not a BMAD project — the skill says so and stops, pointing you at `bmad-bmb-setup` and `bmad-sprint-planning`.
2. **Preflight runs.** Stage 2 is the autonomy gate. It runs a mechanical check (`preflight_check.py`), auto-remediates the fixable ambers (scaffolding the test framework, generating missing acceptance criteria, pre-creating TEA output dirs, and so on), then adds a semantic scan for undecided product or architecture decisions the script cannot see. The run launches **only** when the post-remediation intervention budget is zero and the semantic scan found no red blocker. A single undecided architecture decision stops the run here rather than letting an unattended run guess it.
3. **The launch briefing.** On an attended run, before the first unattended action the skill prints a one-screen briefing: what is about to run, the worst-case turn envelope, the autonomy line ("from here I will not ask you anything"), the kill switch (Ctrl-C, or delete the Epic branch — `/rewind` will not help), and where to watch (the run's `.decision-log.md` and `run-status.json`). One soft confirm crosses the line.

From there the run is autonomous: it defines done with TEA, executes each story to a green commit, gates each one deterministically, and finalizes with a run report and the deferred-work ledger. See [how it works](how-it-works.md) for the full stage-by-stage narration.

The whole first run, from install to run report, with the two points where it can refuse to launch and the verdict that decides each story:

```mermaid
flowchart TD
    I["npx install"] --> A["Activate skill via trigger phrase"]
    A --> N["Stage 1: name the Epic"]
    N --> BMAD{"BMAD project?"}
    BMAD -->|"no"| STOP1["Stop: point at setup skills"]
    BMAD -->|"yes"| PF["Stage 2: preflight check then auto-remediate ambers"]
    PF --> GATE{"Budget 0 and no red blocker?"}
    GATE -->|"no"| STOP2["Stop: write blockers to decision log"]
    GATE -->|"yes"| BRIEF["Arm branch, hooks, allowlist; launch briefing; one soft confirm"]
    BRIEF --> RUN["Autonomous run: define done, execute each story to a green commit"]
    RUN --> EVAL["Stage 5: gate_eval.py reads TEA gate-decision.json"]
    EVAL --> V{"verdict"}
    V -->|"advance"| NEXT["Next story, or finalize when last"]
    V -->|"defer"| NEXT
    V -->|"reloop"| RUN
    V -->|"escalate"| STOP3["Stop: surface the blocker"]
    NEXT --> RPT["Stage 6: run report and deferred-work ledger"]
    class STOP1,STOP2,STOP3 stop
    class RPT verdict
    classDef stop fill:#9CA3AF,stroke:#6B7280,color:#fff
    classDef verdict fill:#4F46E5,stroke:#3730A3,color:#fff
```

## Run-mode flags

| Flag | Effect |
|------|--------|
| `--light` | Trace-only gate. Downscopes from the full TEA chain to `bmad-testarch-trace` plus `gate_eval.py --profile light` — no NFR/test-review AND. |
| `--parallel` | Experimental worktree fan-out. Each story runs isolated in its own worktree; no mid-run input. The sequential `/goal` spine is the default and recommended path — see [parallel mode](parallel-mode.md). |
| `--yes` | Skips Stage 1's open-floor invite and the launch confirm. The launch briefing still prints. **Never** skips the hard preflight gate. |
| `-H` | Headless. Runs non-interactively, never prompts (an unresolvable secret becomes a red blocker, not a question), and emits one JSON object at every exit point. |
| `--retro` | Runs the close-out retrospective (`bmad-retrospective`). Interactive runs offer it at Epic close anyway; headless runs it only when `--retro` is passed. |

## Hook security

At preflight the skill auto-merges its **PreToolUse** guard and **Stop** budget hook into `.claude/settings.local.json` — a machine-local, gitignored file, honored after the workspace trust dialog. These hooks are the enforcement layer that blocks a commit on a protected branch and bounds a runaway story; they are not shared into the repo. Because they execute on your machine, review what gets merged: see [SECURITY.md](../SECURITY.md) for the hook-security model and what to check before granting trust.
</document>

<document path="health-check.md">
Every UltraCode Goal run that reaches Finalize ends with a health check: a brief self-improvement reflection that audits the run for friction, gaps, or bugs in *this module* and, when it finds something, offers to file a structured GitHub issue. The expected outcome is **zero findings** — a clean run exits in a line. This page covers exactly what it sends, the privacy model, how it dedups, and how to turn it off. The deterministic fingerprint and seen-cache plumbing is [`../skills/ultracode-goal/scripts/health_check_fp.py`](../skills/ultracode-goal/scripts/health_check_fp.py).

## What it is, and when it runs

The health check is a reflection step that runs **after every run that reaches Finalize** — a completed Epic, or one that ended in a story escalation. It does **not** run on an early block: a Stage 1 or Stage 2 stop never executed any of the module's run machinery, so there is nothing to audit. Findings are graded into three severities: `bug`, `friction`, and `gap`.

## Exactly what gets sent

A finding is a structured issue with a fixed set of fields plus an Environment table. The Environment table carries:

- Date
- OS
- AI Editor
- Model
- Profile (production / light)
- Run mode (attended / headless)
- Module Version

Explicitly **not** sent: no source code, no Epic content, no secrets. The evidence in a finding is `file:line` citations into **this module's own files** — the skill, its reference stages, and its scripts — never your project's code. A finding is a claim that *the module* could be better, backed by a pointer to the module file that proves it.

## Privacy

Issues are filed publicly on `armelhbobdad/bmad-module-ultracode-goal`. On an **attended** run nothing is sent silently: the health check always **HALTS at a `[Y]` / `[N]` / `[E]` gate** before submitting — yes to file, no to skip, edit to adjust first. You see the finding and approve it before it leaves your machine.

The boundary below shows the disable path, everything computed on your machine, and the two points where anything crosses to GitHub — both behind a gate:

```mermaid
flowchart TD
    Start["Reaches Finalize"] --> Enabled{"health_check_repo set"}
    Enabled -->|"empty"| Off["Log one line and exit. Nothing computed, nothing sent"]
    Enabled -->|"set"| Reflect["Reflect on this run. Grade bug, friction, gap"]
    Reflect --> Clean{"Any findings"}
    Clean -->|"no"| Done["Clean run. Exit in one line"]
    Clean -->|"yes"| Gate{"Approved to submit"}
    Gate -->|"no, queue or skip"| Queue["Write finding to local queue. Stays on disk"]
    Gate -->|"yes"| Fp["Compute fingerprint and check seen-cache"]
    Fp --> Search["Remote dedup search on GitHub"]
    Search --> Create["Create or react on public issue"]

    subgraph local["On your machine"]
        Enabled
        Off
        Reflect
        Clean
        Done
        Gate
        Queue
        Fp
    end
    subgraph net["Leaves the machine"]
        Search
        Create
    end

    classDef accent fill:#6366F1,stroke:#4F46E5,color:#fff
    classDef verdict fill:#4F46E5,stroke:#3730A3,color:#fff
    class Gate accent
    class Create verdict
```

The gate that admits anything to the right of the boundary is the `[Y]` approval on an attended run, or the autosubmit opt-in restricted to `bug`-severity findings on a headless run; the Environment table — never source code, Epic content, or secrets — is all that travels with a filed issue.

## Unattended behavior

Headless runs do not have a human at the gate, so by default they **queue findings locally and never live-submit**. Two config knobs (set in `{project-root}/_bmad/custom/ultracode-goal.toml`) change this:

- `health_check_repo = ""` — disables the health check entirely (the target repo doubles as the on/off switch; no target, nothing to run).
- `health_check_autosubmit = true` — opts **bug-severity** findings into live submission on unattended runs. `friction` and `gap` findings are **never** auto-submitted; they always queue regardless of this setting.

The local queue lives at the configured `health_check_queue_path` (by default `{project-root}/_bmad-output/ultracode-goal/improvement-queue/`); each finding is written one file per finding, named `hc-ultracode-goal-{stage}-{YYYYMMDD-HHmmss}.md`.

## Deduplication

Findings are deduped by a deterministic fingerprint so the same defect does not file twice. The fingerprint (`health_check_fp.py fingerprint`) is:

```
fp-XXXXXXX = "fp-" + sha1("{severity}|ultracode-goal/{stage}|"
                          "skills/ultracode-goal/references/{stage}.md|{section-slug}")[:7]
```

The `step_file` component is always the source-repo form `skills/ultracode-goal/references/{stage}.md` regardless of where the skill is installed, so the same defect dedups to the same key across a dev checkout and an installed `_bmad/` tree. `severity` is one of `bug`/`friction`/`gap`; `stage` is one of the six stage names (`ingest-and-scope`, `preflight`, `define-done`, `execute`, `gate`, `finalize`); `section-slug` is validated kebab-case.

Dedup runs at three levels:

1. **Machine-global seen-cache** at the configured `health_check_seen_cache` (by default `~/.ultracode-goal/health-check-seen.json`) — the `seen` / `record` subcommands check and atomically merge-write this cache (a missing, empty, or corrupt cache is treated as empty, never a crash). Each record carries the issue URL, the action taken (`created` / `reacted` / `commented` / `queued`), and the date.
2. **Remote search** — before filing, the existing issues are searched so a finding already filed by another machine is not duplicated.
3. **Server-side** — a repository Action closes duplicates and upvotes the canonical issue, so even a race between two machines converges on one issue.

## Submitting a queued finding manually

A queued finding sits as a body file under the queue directory. It carries YAML frontmatter (workflow, step_file, severity, fingerprint, date) followed by the same structured body the attended gate would have shown you, including the Environment table and the `file:line` evidence. Submit one when you are ready with:

```bash
gh issue create --repo <health-check-repo> --title "<title>" --body-file <path-to-queued-finding>
```

Nothing is sent until you run this command. You can also open one through the repository's issue chooser at <https://github.com/armelhbobdad/bmad-module-ultracode-goal/issues/new/choose>.
</document>

<document path="how-it-works.md">
UltraCode Goal runs an Epic through six stages, in order. Each stage routes to the next by testable conditions stated in its reference file under [`../skills/ultracode-goal/references/`](../skills/ultracode-goal/references/). This page narrates the stages faithfully, the conditions that move between them, and the headless contract. For the design behind it, see [architecture](architecture.md); for the gate specifically, see the [gate model](gate-model.md).

## The six stages

| # | Stage | Routes by |
|---|-------|-----------|
| 1 | Ingest & Scope | one resolved Epic id, or stop |
| 2 | Preflight | post-remediation budget == 0 and no red, or stop |
| 3 | Define Done | every in-scope story has a red-phase atdd-checklist |
| 4 | Execute | every story committed at green, or a turn-bound escalation |
| 5 | Gate | the `gate_eval.py` verdict: advance / defer / reloop / escalate |
| 6 | Finalize | terminal — report, ledger, memory capture |

The stages run in order, but the edges are conditional — each one only advances on a testable condition, and two of them loop backward on failure. This shows the real routing, including the preflight remediation loop and the gate re-loop:

```mermaid
flowchart TD
    S1["Stage 1 Ingest and Scope"]
    NotBmad["STOP — not a BMAD project"]
    S2["Stage 2 Preflight"]
    Remediate["Auto-remediate then re-run check"]
    Blocked["STOP or blocked — RED or budget gt 0"]
    S3["Stage 3 Define Done"]
    S4["Stage 4 Execute"]
    S5["Stage 5 Gate via gate_eval.py"]
    Correct["bmad-correct-course"]
    S6["Stage 6 Finalize"]

    S1 -->|"config + sprint + epic all absent"| NotBmad
    S1 -->|"one resolved epic id"| S2
    S2 -->|"remediable blocker"| Remediate
    Remediate --> S2
    S2 -->|"RED found or budget gt 0"| Blocked
    S2 -->|"budget == 0 and no RED and ultracode plus Auto Mode on"| S3
    S3 -->|"ATDD hard-halt on vague ACs"| S3
    S3 -->|"every story has red-phase atdd-checklist"| S4
    S4 -->|"every story committed at green"| S5
    S4 -->|"turn-bound escalation"| S5
    S5 -->|"advance or defer"| S6
    S5 -->|"reloop — gate FAIL within budget"| Correct
    Correct --> S4
    S5 -->|"escalate — NOT_EVALUATED or budget exhausted"| S6

    classDef accent fill:#6366F1,stroke:#4F46E5,color:#fff
    classDef verdict fill:#4F46E5,stroke:#3730A3,color:#fff
    classDef stop fill:#9CA3AF,stroke:#6B7280,color:#fff
    class S5 verdict
    class S2 accent
    class NotBmad,Blocked stop
```

A `defer` verdict appends non-blocking items to the ledger and advances anyway; an `escalate` ends the run as `blocked` at Stage 6 rather than `complete`. The reloop edge re-runs the story only while turn and token budget remain — once exhausted, a FAIL becomes an escalate.

### Stage 1 — Ingest & Scope

Resolve **which** Epic this run delivers and lock the profile. The operator names the Epic (or the skill picks the obvious in-flight one from `sprint-status.yaml`); the skill locates the Epic/story files, the PRD, and the ADR/architecture, and records the paths to the run's `.decision-log.md`. This is the cheap stage that prevents an expensive run from targeting the wrong Epic.

The one absence that hard-stops here: if `_bmad/` config **and** `sprint-status.yaml` **and** any Epic are *all* absent, this is not a BMAD project — the skill points at `bmad-bmb-setup` and `bmad-sprint-planning` and stops. A title-only Epic with no stories does **not** stop here (Stage 2 generates the missing stories); an Epic whose stories are all already `done` triggers an "already complete — re-run anyway?" check. If the Epic cannot be resolved to exactly one id, the skill asks rather than guessing. See [`references/ingest-and-scope.md`](../skills/ultracode-goal/references/ingest-and-scope.md).

### Stage 2 — Preflight (the autonomy gate)

This is the load-bearing gate, because after it the run goes unattended. The posture is **hard gate with auto-remediation**:

1. **Mechanical check** — `preflight_check.py` parses tool versions, git state, and file existence and returns a `budget` count of mechanical blockers (test framework absent, dirty tree, on a protected branch, Claude Code below the minimum versions). It does **not** decide semantic intervention.
2. **Auto-remediation pass** — clear each remediable blocker, then re-run the check so `budget` reflects the fixes: scaffold the test framework (`bmad-testarch-framework`), scaffold the CI quality pipeline (`bmad-testarch-ci`, production only, strictly *after* the framework), generate missing acceptance criteria (`bmad-create-story`), pre-create the TEA output dirs, ensure exactly one `project-context.md`, ensure `sprint-status.yaml` is present, force TEA **Create** mode, and prompt once (interactively) for any secrets.
3. **Semantic intervention scan** — the part the script cannot do: read the PRD and ADR for undecided product/architecture decisions, contradictions, acceptance criteria that presuppose an unmade decision, or a story whose "done" is undefinable. Any such item is **RED** and cannot be auto-remediated, because the fix is a human decision.

The run launches **only** when all hold: post-remediation `budget == 0`, the semantic scan found no RED, and ultracode session effort plus Auto Mode are on. Then the skill arms the environment — creates the Epic branch off `epic_branch_prefix`, merges the PreToolUse and Stop hooks into `.claude/settings.local.json` (asserting they are active, and injecting the resolved config into their env), and pre-populates the allowlist. On an attended run it prints the launch briefing and takes one soft confirm. See [`references/preflight.md`](../skills/ultracode-goal/references/preflight.md).

### Stage 3 — Define Done

Turn the Epic's acceptance criteria into **executable, red-phase acceptance tests** before any production code is written. Once per Epic, `bmad-testarch-test-design` (Epic-Level Mode) builds the risk-and-priority backbone: a risk matrix with scored, mitigated risks; P0–P3 priorities (the gate keys its thresholds to these); and NFR thresholds (unknowns are marked `UNKNOWN` and deferred, never guessed). Then, per in-scope story in sprint order: `bmad-create-story` sharpens the acceptance criteria, and `bmad-testarch-atdd` generates an `atdd-checklist-{story_key}.md` plus acceptance test files **every test marked `test.skip()`** (TDD red phase). ATDD hard-halts if a story's ACs are vague or the framework is missing — that is the signal to loop back to `bmad-create-story` for that story. Stage 3 is done only when every in-scope story has a story file with clear ACs and a generated atdd-checklist with red-phase tests on disk. See [`references/define-done.md`](../skills/ultracode-goal/references/define-done.md).

### Stage 4 — Execute

Drive each in-scope story from its red-phase tests to a green, committed state. The default is the **sequential `/goal` spine**; per story, in sprint order: set the current story (so the PreToolUse hook can find its marker) → `bmad-dev-story` implements the feature and un-skips the story's ATDD tests → run tests/lint/build and **print the raw output** as evidence → (production) `bmad-testarch-test-review` then `bmad-code-review` → commit at green (one commit per green story). The loop is wrapped in a single `/goal` whose condition encodes the per-story Definition-of-Done and carries the literal "…or stop after N turns" escape clause. The printed evidence keeps the run judgeable mid-flight, but **passing the `/goal` condition is not completion** — the authoritative verdict is Stage 5. The experimental `--parallel` path fans the same per-story loop out across worktree-isolated agents; see [parallel mode](parallel-mode.md). As the spine advances it overwrites a `run-status.json` heartbeat for pollers. See [`references/execute.md`](../skills/ultracode-goal/references/execute.md).

### Stage 5 — Gate

Decide whether a story (or, after the last story, the Epic) advances — by a deterministic artifact read. In production, the skill first backfills the evidence in order — `bmad-testarch-automate`, `bmad-testarch-trace` (which writes the gate decision), `bmad-testarch-nfr` — then runs `gate_eval.py`. The script reads TEA's `gate-decision.json` and returns a verdict the skill executes: `advance` (move to the next story), `defer` (append non-blocking items to the ledger and advance anyway), `reloop` (run `bmad-correct-course`, re-run the story within the remaining budget), or `escalate` (stop). The invariant: **a P0/critical FAIL never defers** — it re-loops within budget or escalates. See the [gate model](gate-model.md) and [`references/gate.md`](../skills/ultracode-goal/references/gate.md).

This is how the verdict is read deterministically — the conductor never grades the work itself, it runs the script and routes on what comes back:

```mermaid
sequenceDiagram
    participant C as Conductor
    participant TEA as TEA trace
    participant G as gate_eval.py
    participant F as gate-decision.json
    C->>TEA: bmad-testarch-trace writes gate decision
    C->>G: run gate_eval.py --trace-output DIR
    G->>F: resolve and read slim file
    alt slim file absent
        G->>F: fall back to e2e-trace-summary.json
    end
    F-->>G: gate_status
    Note over G: PASS or WAIVED to advance, CONCERNS to defer, FAIL to reloop, NOT_EVALUATED to escalate
    Note over G: production only — NFR FAIL or review lt 80 or Block downgrades advance to reloop
    G-->>C: verdict + reasons JSON
    C->>C: route the verdict advance / defer / reloop / escalate
```

The production AND fails closed: a missing or unparseable `nfr-assessment.md` or `test-review.md` is treated as a failing signal, so an otherwise-`advance` story downgrades to `reloop` rather than advancing on evidence the script could not read.

### Stage 6 — Finalize

Make the run pay off for the next one. Capture learnings deliberately — machine-local quirks to Auto Memory (`remember X`), team standards to the project's CLAUDE.md or `.claude/rules`. Optionally run the retrospective (`--retro`). Audit every `.decision-log.md` entry into the report, the addendum, or explicit process-noise. Produce a `run-report.md` (Epic, profile, per-story outcomes, the Epic-level gate, budget consumed, learnings, a pointer to the ledger), write the terminal `run-status.json`, surface this Epic's deferred-work ledger heading to the user, and fire the `on_epic_complete` hook **only** when the Epic actually advanced. See [`references/finalize.md`](../skills/ultracode-goal/references/finalize.md).

## Production vs. `--light`

The **production** profile wires the full TEA chain as gates: test-design, atdd, automate, test-review, nfr, trace, ci. **`--light`** downscopes to the trace gate only — Stage 5 skips automate/nfr/test-review backfill and runs only `bmad-testarch-trace`, then `gate_eval.py --profile light`, with no NFR/review AND. The profile is locked in Stage 1 and read (not re-derived) by Stages 3 and 5.

## The decision log

The run's `.decision-log.md` — held in the skill's run folder — is canonical memory. Compaction can drop everything else; the log recovers full state. It records scope, the preflight verdict, every gate outcome, every deferral, and (in headless) every assumption. **Resume** reads it: on a resumed run, Execute re-enters at the first story whose last logged gate verdict is not `advance`; advanced stories are not re-run, and the Epic branch, hooks, and allowlist are re-asserted (not rebuilt) before continuing.

## The run report and deferred-work ledger

At the end, two durable outputs sit beside the decision log. The **run report** (`run-report.md`) is the human takeaway. The **deferred-work ledger** (at `deferred_work_path`) holds one heading per Epic with a row per parked item — only non-gate-blocking work lands here (CONCERNS, non-critical findings, parked decisions); a P0/critical FAIL is never deferred. Finalize surfaces this run's Epic heading so nothing parked is invisible at handoff.

## Headless contract

With `-H`, the run is non-interactive: infer scope, default to production (unless `--light`), never prompt. Every exit point — a complete run at Stage 6, or an early block at Stage 1 (not a BMAD project / Epic unresolved / already complete), Stage 2 (preflight), or a Stage 6 story escalation — emits **one** object with all five keys always present, `null` when an artifact was not produced, and `reason` carrying a one-line cause only when blocked:

```json
{"status": "complete|blocked",
 "skill": "ultracode-goal",
 "decision_log": "<path to this run's .decision-log.md>",
 "report": "<path to run-report.md, or null>",
 "deferred_work": "<path to deferred-work.md, or null>",
 "reason": "<one line, present only when blocked>"}
```

An automator parses one schema regardless of where the run stopped; a blocked-before-report exit returns `report` and `deferred_work` as `null` rather than omitting them.
</document>

<document path="index.md">
## The problem

You hand an agent an epic and tell it to build until done. It runs, it commits, it declares victory. At review time you learn that "done" meant the model felt done — a story it wrote *about* the work, not a verdict *on* the work.

Autonomous runs that look done are not done. The thing deciding completion only ever sees the transcript; it cannot open the gate file written to disk. A model grading its own output is the weakest possible signal for a release gate, and by default it is the only signal you get.

## The fix

UltraCode Goal does not trust the transcript. It hard-gates the epic *before* launch and reads completion from a file *after* the work — three enforcement layers between "the agent stopped" and "the epic shipped":

- **A preflight gate that fails closed.** The run launches only when `preflight_check.py` returns green *after* its remediation pass, with the intervention budget at zero. A red blocker stops the run; it does not become a question for later.
- **TEA red-phase tests as the Definition-of-Done.** The Test Architect turns each story's acceptance criteria into executable, failing tests *first*, so "done" is a measurable transition from red to green — not prose.
- **A deterministic gate verdict.** A story advances only when `gate_eval.py` reads `PASS` from TEA's `gate-decision.json`. It never re-derives the thresholds and never asks the model. The verdict JSON is the truth, and you can read it yourself.

<div class="verdict-sample"><span class="verdict-sample__label">The completion verdict</span><code class="verdict-sample__chip">gate-decision.json → PASS</code><span class="verdict-sample__check" aria-label="machine-checked">✓</span></div>

If the gate file is missing or unparseable, the contract counts it as a *failing* signal — prose drift degrades to a conservative re-loop, never a silent false-advance.

<p class="cta-pill"><a href="./getting-started/">Install and run your first epic →</a></p>

## What you get

Completion stops being a feeling in the transcript and becomes a fact on disk. Every green story is one git commit on an isolated epic branch — rollback you can actually trust, not a checkpoint that misses Bash changes. The run ends with a delivered, gate-passed epic, a run report, and a deferred-work ledger of anything safely parked for later.

## Read the rest

The docs split into three buckets — **Why** (start here), **Try** (do stuff), and **Reference** (look things up).

**Why**

- [Why UltraCode Goal](./why-ultracode-goal.md) — the problem in depth, the three enforcement layers, and when not to use it.

**Try**

- [Getting Started](./getting-started.md) — prerequisites, install, the flags, and your first autonomous run.
- [How It Works](./how-it-works.md) — the six stages, their routing conditions, and the headless emit shape.
- [Parallel Mode](./parallel-mode.md) — the experimental worktree fan-out and its known limits.

**Reference**

- [Architecture](./architecture.md) — the conductor model, the enforcement layers in depth, and customization resolution.
- [Gate Model](./gate-model.md) — how `gate_eval.py` maps `gate_status` to a verdict, the thresholds, and the fail-closed contract.
- [Health Check](./health-check.md) — the terminal self-improvement reflection: what it sends, the privacy model, and how to disable it.
- [Cross-Session Recall](./cross-session-recall.md) — the optional claude-mem integration and its trust model.
- [Troubleshooting](./troubleshooting.md) — real failure modes and their remediations.
</document>

<document path="parallel-mode.md">
> **Experimental, opt-in.** `--parallel` is an additive execution path. The sequential `/goal` spine ([how it works](how-it-works.md), Stage 4) is the **default and recommended** path. Use `--parallel` only when you understand the known limits below, and expect to fall back to the spine.

When the operator passes `--parallel`, Stage 4 fans the Epic out across worktree-isolated per-story agents instead of driving them one at a time on the spine. This page covers what it does, how concurrency is bounded, and — honestly — where it is not yet validated. It is sourced from [`../skills/ultracode-goal/assets/execute-epic.workflow.js`](../skills/ultracode-goal/assets/execute-epic.workflow.js) and [`references/execute.md`](../skills/ultracode-goal/references/execute.md).

## What it does

`--parallel` invokes the saved dynamic workflow `execute-epic.workflow.js` (registered as `/ultracode-goal-execute`). Each in-scope story runs in its **own git worktree** on its own branch, so concurrent stories never overwrite each other's working tree. Within a worktree the steps are the same as the spine and strictly ordered: `bmad-create-story` (Create mode) → `bmad-dev-story` un-skipping the story's red-phase ATDD tests → run and print tests/lint/build, then write the tests-ran marker → (production) `bmad-testarch-test-review` then `bmad-code-review` → commit at green → per-story `gate_eval.py`. After every story lands, the workflow runs one epic-level trace gate and returns `{ perStory: [{story, verdict, gate_status}], epicGate, deferred: [...] }`, which feeds Stages 5 and 6.

The Epic branch is the trunk every story forks from, and the epic-level gate is where they converge:

```mermaid
flowchart TD
    EB["Epic branch"]
    EB --> WA["Worktree story A"]
    EB --> WB["Worktree story B"]
    EB --> WC["Worktree story N"]
    WA --> GA["Per-story gate_eval.py"]
    WB --> GB["Per-story gate_eval.py"]
    WC --> GC["Per-story gate_eval.py"]
    GA --> EG["Epic-level trace gate"]
    GB --> EG
    GC --> EG
    EG --> RET["Merged result -> Stages 5 and 6"]

    subgraph STEPS["Strict order inside each worktree"]
        direction LR
        S1["create-story"] --> S2["dev-story un-skip ATDD"]
        S2 --> S3["tests, lint, build then marker"]
        S3 --> S4["production review"]
        S4 --> S5["commit at green on story branch"]
    end

    classDef accent fill:#6366F1,stroke:#4F46E5,color:#fff
    classDef verdict fill:#4F46E5,stroke:#3730A3,color:#fff
    class EB accent
    class EG,RET verdict
```

Stories are batched to `parallel_max_concurrency` (default 8): each batch fans out in parallel, batches run sequentially, and a worktree commit on its own story branch is the per-story unit of work. The "merge back" is the epic-level trace gate consolidating the landed branches into one verdict object — not a mid-run interactive step, since the fan-out takes no input once launched.

Critically, this path **shares the sequential spine's truth sources**: the same `gate_eval.py` reading TEA's `gate-decision.json` (never the model, never the transcript-only `/goal` evaluator), and the same PreToolUse + Stop hooks merged at preflight enforce the invariants. The verdict mapping is owned by `gate_eval.py`; the spawned agents return its stdout fields verbatim and must not recompute TEA thresholds. See the [gate model](gate-model.md).

## Concurrency

The cap on simultaneous worktree agents is `parallel_max_concurrency` — **default 8** in `customize.toml`, chosen under the platform's 16-concurrent ceiling. Stories are batched: each concurrency-sized batch runs in parallel; batches run sequentially. The literal `4` inside the `.js` is only a fallback when the skill invokes the workflow without supplying `max_concurrency`; the governing value is the passed `parallel_max_concurrency`.

## No mid-run input

The fan-out takes **no interactive input once launched** — every gate and every blocker must be resolved at preflight or not at all. This is exactly why the [preflight](how-it-works.md) hard gate requires a post-remediation budget of zero before launch: there is no opportunity to answer a question mid-run, so a run that would have needed an answer must refuse to launch instead.

## Known limits — be honest

This path leans on workflow↔skill interplay the platform docs leave under-specified. Treat its behavior as empirically validated, not guaranteed:

- **Shared Auto Memory across worktrees.** All worktrees of one git repo share a single Auto Memory directory — there is no per-worktree isolation of learned facts. Concurrent writers can collide and interleave; expect interleaving rather than clean per-story memory.
- **Under-documented workflow↔skill interplay.** How args bind, and how the spawned subagents inherit the allowlist and the hooks, is not fully specified by the platform docs. The skill threads the resolved `skill_root` into the workflow args so the spawned agents get an absolute `gate_eval.py` path (the runtime has no `{skill-root}` resolver), but the broader handoff is treated as empirically validated, not guaranteed end to end.
- **No `run-status.json` heartbeat.** Worktree agents each see their own copy of `implementation_artifacts`, so this path cannot reliably write one shared snapshot — it does not write `run-status.json`. Watch progress via the workflow progress view (`/workflows`) and its run log instead; the launch briefing says so.

## Graceful degradation

If dynamic workflows are unavailable — wrong Claude Code version, the workflows feature off, or the saved command does not resolve — the skill **falls back to the sequential `/goal` spine** and logs a one-line note in `.decision-log.md` recording why `--parallel` degraded. The Epic still ships; it just ships sequentially. This is the safety net behind the recommendation to treat the spine as the default: choosing `--parallel` never risks the Epic, only the mode.

See [troubleshooting](troubleshooting.md) for what to check when `--parallel` does not behave, and [architecture](architecture.md) for how the workflow asset fits the conductor model.
</document>

<document path="troubleshooting.md">
Real failure modes, sourced from the skill's stage files and scripts, with what the run does about each and what you do. For the design behind these behaviors see [how it works](how-it-works.md), [the gate model](gate-model.md), and [architecture](architecture.md).

Start from the symptom you observed and follow it to the section that explains it:

```mermaid
flowchart TD
    S["What did you observe"] --> P{"Run never launched"}
    P -->|"yes"| PRE["See: Preflight cannot reach budget-zero"]
    P -->|"no"| G{"Stopped at a gate, status NOT_EVALUATED or escalate"}
    G -->|"yes"| GATE["See: gate_eval reports blocked / escalate"]
    G -->|"no"| H{"A commit slipped past a guard"}
    H -->|"yes"| HOOK["See: Hooks not firing"]
    H -->|"no"| B{"A story stopped re-looping, escalation marker appeared"}
    B -->|"yes"| BUD["See: Budget exhausted mid-story"]
    B -->|"no"| R{"Run was interrupted, want to continue"}
    R -->|"yes"| RES["See: Resume after an interruption"]
    R -->|"no"| PAR["See: --parallel issues"]
    classDef accent fill:#6366F1,stroke:#4F46E5,color:#fff
    class PRE,GATE,HOOK,BUD,RES,PAR accent
```

## Preflight can't reach budget-zero

**Symptom.** The run stops at Stage 2 (or, headless, emits `{"status":"blocked", ...}` with a one-line `reason`) and never launches.

**What happened.** Preflight is a hard gate: the run launches only when the post-remediation mechanical `budget == 0` **and** the semantic scan found no RED. The auto-remediation pass clears the *fixable* mechanical blockers — it scaffolds the test framework, generates missing acceptance criteria, pre-creates the TEA output dirs, ensures one `project-context.md`, ensures `sprint-status.yaml`, and prompts once (interactively) for secrets — then re-runs the check.

**What stays red.** Some things the remediation pass cannot fix, by design:

- **Undecided product or architecture decisions.** An open question, a "TBD"/"TODO: decide" on a load-bearing requirement, a PRD↔ADR contradiction, or a story whose "done" is undefinable. The fix is a human decision; an unattended run guessing it produces confidently wrong work. Resolve the decision in the artifacts, then re-run.
- **Missing secrets in headless.** Headless never prompts, so a secret that cannot be resolved becomes a RED blocker rather than a question. Provide the secret (out of git) before the headless run, or run attended so preflight can prompt once.
- **Claude Code below the minimum versions.** The primitive-version blocker is marked non-remediable — the script can't upgrade the host. Update Claude Code.

The decision log carries the full blocker list with what each needs to clear. Read it, clear the items, re-run.

## gate_eval reports blocked / escalate on a missing gate-decision.json

**Symptom.** Stage 5 returns `gate_status: NOT_EVALUATED` and verdict `escalate`, with a `reasons` entry like `neither gate-decision.json nor e2e-trace-summary.json present in <dir>`.

**What happened.** `gate_eval.py` reads TEA's gate artifact from the trace output directory. `NOT_EVALUATED` means neither the slim `gate-decision.json` nor the fallback `e2e-trace-summary.json` was found there, or the run carried no gate fields. Almost always this is one of:

- **The TEA trace gate did not run.** In production, Stage 5 must backfill evidence first (`bmad-testarch-automate` → `bmad-testarch-trace` → `bmad-testarch-nfr`) before the gate; `bmad-testarch-trace` is what writes the gate decision. If it didn't run, there is nothing to read.
- **Wrong `trace_output_dir`.** The script reads the directory passed as `--trace-output` (resolved from `{workflow.trace_output_dir}`). If TEA wrote elsewhere — or the output dirs were never pre-created at preflight — the artifact is real but in a different place. Confirm `trace_output_dir` matches where TEA actually wrote.

Note this is fail-closed on purpose: a missing or unreadable gate artifact escalates rather than being assumed green. The slim file's *absence alone* is not the problem — the script falls back to the summary, and that fallback is explicitly not a failure.

How `gate_eval.py` resolves the artifact and why only the dead-end branch escalates:

```mermaid
flowchart TD
    A["Read trace-output dir"] --> B{"gate-decision.json present"}
    B -->|"yes"| OK["Read gate_status"]
    B -->|"no, normal"| C{"e2e-trace-summary.json present"}
    C -->|"yes, fallback not a failure"| OK
    C -->|"no"| NE["gate_status NOT_EVALUATED"]
    OK --> V{"gate_status value"}
    V -->|"PASS or WAIVED"| ADV["advance"]
    V -->|"CONCERNS"| DEF["defer"]
    V -->|"FAIL"| REL["reloop"]
    V -->|"NOT_EVALUATED"| ESC["escalate"]
    NE --> ESC
    classDef verdict fill:#4F46E5,stroke:#3730A3,color:#fff
    class ADV,DEF,REL,ESC verdict
```

## Hooks not firing

**Symptom.** A commit lands on a protected branch, or a commit lands before a story's tests ran — the invariants the PreToolUse hook should enforce did not block.

**What to check.**

- **Older Claude Code.** The hook returns a `deny` decision in the hook JSON and *also* exits 2 with the reason on stderr precisely so older clients that ignore the JSON still block. If neither path fired, the client may not be honoring PreToolUse hooks at all — update Claude Code.
- **`settings.local.json` not merged.** The hooks are merged into `{project-root}/.claude/settings.local.json` at preflight, and the skill asserts they are active before going unattended. If the file wasn't merged (or the workspace trust dialog wasn't accepted), the hooks aren't loaded. Re-run preflight; verify the two hook entries are present in the resolved settings.
- **A `customize.toml` override that silently no-ops.** Both hooks read config from env first and fall back to hardcoded defaults (`main`/`master`, `25`, `1_500_000`, `ultracode/epic-`). A `protected_branches` or budget override in `customize.toml` only reaches the hook if preflight injected it into the hook env (`ULTRACODE_PROTECTED_BRANCHES`, etc.). If your custom protected branch isn't being guarded, the override didn't reach the enforcement layer — confirm preflight passed it through.

## Budget exhausted mid-story

**Symptom.** A story stops re-looping; an escalation marker (`<impl-artifacts>/.escalation-<story>.md`) appears; the run surfaces a budget message.

**What happened.** A runaway story is bounded three ways. The real in-loop bound is the literal "…or stop after N turns" clause inside the `/goal` condition. The gate's re-loop budget is deterministic: a `reloop` that would exceed `max_turns_per_story` or `story_token_budget` becomes an `escalate` instead. The **Stop** hook (`budget_stop.py`) is the defensive third layer — it counts turns and tokens and, on overrun, writes the escalation marker and lets the stop proceed.

**Its documented limitation.** A Stop hook fires only when Claude is *already* trying to stop — it **cannot interrupt a `/goal` condition mid-turn**. So at this layer the ceiling is advisory; the hard bounds are the in-condition turn clause and the gate re-loop budget. If a story keeps consuming budget, that is the signal to re-scope, split, or hand it off — not to raise the budget and hope.

## Resume after an interruption

**Symptom.** A run was interrupted (Ctrl-C, a crash, a compaction) and you want to continue rather than restart.

**What happens.** The run's `.decision-log.md` is canonical memory and recovers full state regardless of compaction. On resume the skill surfaces the existing log with its last session date and offers to resume. Execute re-enters at the **first story whose last logged gate verdict is not `advance`**; already-advanced stories are not re-run. The Epic branch, hooks, and allowlist are **re-asserted, not rebuilt**, before continuing. You do not need to reconstruct state by hand — point the skill at the same Epic and accept the resume offer.

## `--parallel` issues

`--parallel` is experimental and opt-in; the sequential spine is the default. If dynamic workflows are unavailable (wrong Claude Code version, the feature off, or the saved command doesn't resolve), the skill **automatically falls back to the spine** and logs why in `.decision-log.md` — the Epic still ships. For the known limits (shared Auto Memory across worktrees, the under-documented workflow↔skill interplay, no `run-status.json` heartbeat), see [parallel mode](parallel-mode.md).
</document>

<document path="why-ultracode-goal.md">
Autonomous coding agents have one failure mode that dwarfs the rest: a run that **looks** done and isn't. The agent reports green, the transcript reads clean, and the Epic ships with a P0 acceptance criterion silently unmet. UltraCode Goal exists to make that specific failure impossible — by deciding completion from a deterministic artifact a script reads, not from anything the model itself produces. This page explains the problem, the three enforcement layers that answer it, and when the module is the wrong tool.

## The problem

Three intuitive shortcuts all push an unattended Epic toward false completion, and each one is wrong for a documented mechanical reason.

**The evaluator only sees the transcript.** Claude Code's `/goal` mode drives a loop until a success condition is met, then asks an evaluator to confirm. But that evaluator reads the transcript — it cannot open a file on disk. It cannot read TEA's `gate-decision.json`. So if you let the `/goal` condition be the final word on "is this story done," you have asked the model to grade its own homework from its own notes. The model cannot be the judge of its own completion: the thing that decides "done" has to be something it cannot author.

The contrast is what the two judges read — the transcript-only evaluator versus the script that opens the artifact on disk:

```mermaid
flowchart TD
    work["Story implementation"] --> tx["Transcript: printed evidence the model authored"]
    work --> art["gate-decision.json: artifact TEA wrote to disk"]
    tx --> ev["/goal evaluator"]
    ev --> feels["Verdict: looks done"]
    art --> ge["gate_eval.py reads the artifact"]
    ge --> checked["Verdict: machine-checked done"]
    feels -.->|"can be talked into green"| risk["False completion"]
    classDef accent fill:#6366F1,stroke:#4F46E5,color:#fff
    classDef verdict fill:#4F46E5,stroke:#3730A3,color:#fff
    class art,ge accent
    class checked verdict
```

**`/rewind` checkpoints miss Bash changes.** The obvious undo for a runaway agent is `/rewind`. But its checkpoints do not capture changes made through Bash — and an autonomous Epic run is mostly Bash: test commands, lint, build, and the git commits themselves. Rolling back to a checkpoint leaves the working tree's Bash-driven mutations in place. The real undo for this kind of run is git: an Epic branch off a protected branch, one commit per green story, worktree isolation. See the [gate model](gate-model.md) and [architecture](architecture.md) for how this is wired.

**Memory is context, not enforcement.** It is tempting to encode the run's invariants ("never commit on `main`", "never commit before tests pass") into Auto Memory or a CLAUDE.md rule and trust the model to honor them. But memory is context the model may or may not weigh — it does not *block* a `git commit`. An invariant that lives only in memory is a suggestion. An invariant that must hold has to live somewhere the model cannot talk its way past.

## The thesis: three enforcement layers

UltraCode Goal answers each shortcut with a layer the model cannot override.

**1. Deterministic gate truth.** Completion is decided by `scripts/gate_eval.py`, which reads TEA's `gate-decision.json` and maps its gate status to a routing verdict — `PASS`/`WAIVED` advance, `CONCERNS` defers, `FAIL` re-loops, `NOT_EVALUATED` escalates. The script never re-derives TEA's thresholds and never consults the transcript; it reads the artifact as given. The model produces evidence; the script reads the verdict. They are different things on purpose. See the [gate model](gate-model.md).

**2. Hooks as invariants.** The two invariants that must hold — no commit on a protected branch, no commit before a story's tests have actually run green — live in a **PreToolUse** hook, not in memory. The hook inspects each `git commit`/`git push` and denies it when the branch is protected or the story's tests-ran marker is absent. It is merged into `.claude/settings.local.json` at preflight and asserted active before the run goes unattended. A denied commit is enforcement; a remembered rule is not. See [architecture](architecture.md).

**3. Budget enforcement.** A runaway story is bounded two ways. The `/goal` condition carries a literal "…or stop after N turns" escape clause (the real in-loop bound, because a Stop hook cannot interrupt a `/goal` condition mid-turn), and a **Stop** hook tracks turns and tokens against `max_turns_per_story` / `story_token_budget`, writing an escalation marker when either is breached. The gate's re-loop budget is the third, deterministic bound: a `reloop` that would exceed the turn or token budget becomes an `escalate` instead. See [how it works](how-it-works.md).

Each layer answers one shortcut, and each lives somewhere the model cannot author or override:

```mermaid
flowchart LR
    subgraph S1["Shortcut 1: evaluator judges from notes"]
        sc1["Transcript-only /goal verdict"]
    end
    subgraph S2["Shortcut 2: invariant in memory"]
        sc2["Remembered rule, never blocks"]
    end
    subgraph S3["Shortcut 3: trust the loop to stop"]
        sc3["Unbounded re-loop"]
    end
    sc1 --> L1["Layer 1: gate_eval.py reads gate-decision.json"]
    sc2 --> L2["Layer 2: PreToolUse hook in settings.local.json"]
    sc3 --> L3["Layer 3: turn cap, Stop hook, re-loop budget"]
    L1 --> R1["Status maps to verdict: advance, defer, reloop, escalate"]
    L2 --> R2["Denies commit on protected branch or before tests ran"]
    L3 --> R3["Over-budget reloop becomes escalate"]
    classDef accent fill:#6366F1,stroke:#4F46E5,color:#fff
    classDef verdict fill:#4F46E5,stroke:#3730A3,color:#fff
    class L1,L2,L3 accent
    class R1,R2,R3 verdict
```

These three are the module's non-negotiables. They exist because the documented mechanics make the intuitive shortcut wrong, so the skill does not optimize them away.

## When not to use it

UltraCode Goal is narrow on purpose. It is the wrong tool when:

- **There is no BMAD project.** It needs `_bmad/` config, a `sprint-status.yaml`, or an Epic to target. With none of the three present, Stage 1 hard-stops and points you at `bmad-bmb-setup` and `bmad-sprint-planning`. See [getting started](getting-started.md).
- **There are no Epics or stories.** The unit of delivery is one Epic with acceptance-bearing stories. There is nothing for the gate to read against a loose task list.
- **The work is exploratory.** If the Definition-of-Done is genuinely undecided — open product or architecture questions, "TBD" on a load-bearing requirement — preflight refuses to launch rather than let an unattended run guess. That refusal is correct; resolve the decisions first, then run.
- **You want interactive pairing.** The whole point is that the human leaves the loop after preflight. If you want to review each step, drive the underlying BMAD skills (`bmad-dev-story`, `bmad-code-review`, the TEA workflows) directly instead.
</document>