Skip to content

How It Works

UltraCode Goal runs an Epic through six stages, in order. Each stage routes to the next by testable conditions stated in its reference file under ../skills/ultracode-goal/references/. This page narrates the stages faithfully, the conditions that move between them, and the headless contract. For the design behind it, see architecture; for the gate specifically, see the gate model.

#StageRoutes by
1Ingest & Scopeone resolved Epic id, or stop
2Preflightpost-remediation budget == 0 and no red, or stop
3Define Doneevery in-scope story has a red-phase atdd-checklist
4Executeevery story committed at green, or a turn-bound escalation
5Gatethe gate_eval.py verdict: advance / defer / reloop / escalate
6Finalizeterminal — report, ledger, memory capture

The stages run in order, but the edges are conditional — each one only advances on a testable condition, and two of them loop backward on failure. This shows the real routing, including the preflight remediation loop and the gate re-loop:

flowchart TD
    S1["Stage 1 Ingest and Scope"]
    NotBmad["STOP — not a BMAD project"]
    S2["Stage 2 Preflight"]
    Remediate["Auto-remediate then re-run check"]
    Blocked["STOP or blocked — RED or budget gt 0"]
    S3["Stage 3 Define Done"]
    S4["Stage 4 Execute"]
    S5["Stage 5 Gate via gate_eval.py"]
    Correct["bmad-correct-course"]
    S6["Stage 6 Finalize"]

    S1 -->|"config + sprint + epic all absent"| NotBmad
    S1 -->|"one resolved epic id"| S2
    S2 -->|"remediable blocker"| Remediate
    Remediate --> S2
    S2 -->|"RED found or budget gt 0"| Blocked
    S2 -->|"budget == 0 and no RED and ultracode plus Auto Mode on"| S3
    S3 -->|"ATDD hard-halt on vague ACs"| S3
    S3 -->|"every story has red-phase atdd-checklist"| S4
    S4 -->|"every story committed at green"| S5
    S4 -->|"turn-bound escalation"| S5
    S5 -->|"advance or defer"| S6
    S5 -->|"reloop — gate FAIL within budget"| Correct
    Correct --> S4
    S5 -->|"escalate — NOT_EVALUATED or budget exhausted"| S6

    classDef accent fill:#6366F1,stroke:#4F46E5,color:#fff
    classDef verdict fill:#4F46E5,stroke:#3730A3,color:#fff
    classDef stop fill:#9CA3AF,stroke:#6B7280,color:#fff
    class S5 verdict
    class S2 accent
    class NotBmad,Blocked stop

A defer verdict appends non-blocking items to the ledger and advances anyway; an escalate ends the run as blocked at Stage 6 rather than complete. The reloop edge re-runs the story only while turn and token budget remain — once exhausted, a FAIL becomes an escalate.

Resolve which Epic this run delivers and lock the profile. The operator names the Epic (or the skill picks the obvious in-flight one from sprint-status.yaml); the skill locates the Epic/story files, the PRD, and the ADR/architecture, and records the paths to the run’s .decision-log.md. This is the cheap stage that prevents an expensive run from targeting the wrong Epic.

The one absence that hard-stops here: if _bmad/ config and sprint-status.yaml and any Epic are all absent, this is not a BMAD project — the skill points at bmad-bmb-setup and bmad-sprint-planning and stops. A title-only Epic with no stories does not stop here (Stage 2 generates the missing stories); an Epic whose stories are all already done triggers an “already complete — re-run anyway?” check. If the Epic cannot be resolved to exactly one id, the skill asks rather than guessing. See references/ingest-and-scope.md.

This is the load-bearing gate, because after it the run goes unattended. The posture is hard gate with auto-remediation:

  1. Mechanical checkpreflight_check.py parses tool versions, git state, and file existence and returns a budget count of mechanical blockers (test framework absent, dirty tree, on a protected branch, Claude Code below the minimum versions). It does not decide semantic intervention.
  2. Auto-remediation pass — clear each remediable blocker, then re-run the check so budget reflects the fixes: scaffold the test framework (bmad-testarch-framework), scaffold the CI quality pipeline (bmad-testarch-ci, production only, strictly after the framework), generate missing acceptance criteria (bmad-create-story), pre-create the TEA output dirs, ensure exactly one project-context.md, ensure sprint-status.yaml is present, force TEA Create mode, and prompt once (interactively) for any secrets.
  3. Semantic intervention scan — the part the script cannot do: read the PRD and ADR for undecided product/architecture decisions, contradictions, acceptance criteria that presuppose an unmade decision, or a story whose “done” is undefinable. Any such item is RED and cannot be auto-remediated, because the fix is a human decision.

The run launches only when all hold: post-remediation budget == 0, the semantic scan found no RED, and ultracode session effort plus Auto Mode are on. Then the skill arms the environment — creates the Epic branch off epic_branch_prefix, merges the PreToolUse and Stop hooks into .claude/settings.local.json (asserting they are active, and injecting the resolved config into their env), and pre-populates the allowlist. On an attended run it prints the launch briefing and takes one soft confirm. See references/preflight.md.

Turn the Epic’s acceptance criteria into executable, red-phase acceptance tests before any production code is written. Once per Epic, bmad-testarch-test-design (Epic-Level Mode) builds the risk-and-priority backbone: a risk matrix with scored, mitigated risks; P0–P3 priorities (the gate keys its thresholds to these); and NFR thresholds (unknowns are marked UNKNOWN and deferred, never guessed). Then, per in-scope story in sprint order: bmad-create-story sharpens the acceptance criteria, and bmad-testarch-atdd generates an atdd-checklist-{story_key}.md plus acceptance test files every test marked test.skip() (TDD red phase). ATDD hard-halts if a story’s ACs are vague or the framework is missing — that is the signal to loop back to bmad-create-story for that story. Stage 3 is done only when every in-scope story has a story file with clear ACs and a generated atdd-checklist with red-phase tests on disk. See references/define-done.md.

Drive each in-scope story from its red-phase tests to a green, committed state. The default is the sequential /goal spine; per story, in sprint order: set the current story (so the PreToolUse hook can find its marker) → bmad-dev-story implements the feature and un-skips the story’s ATDD tests → run tests/lint/build and print the raw output as evidence → (production) bmad-testarch-test-review then bmad-code-review → commit at green (one commit per green story). The loop is wrapped in a single /goal whose condition encodes the per-story Definition-of-Done and carries the literal “…or stop after N turns” escape clause. The printed evidence keeps the run judgeable mid-flight, but passing the /goal condition is not completion — the authoritative verdict is Stage 5. The experimental --parallel path fans the same per-story loop out across worktree-isolated agents; see parallel mode. As the spine advances it overwrites a run-status.json heartbeat for pollers. See references/execute.md.

Decide whether a story (or, after the last story, the Epic) advances — by a deterministic artifact read. In production, the skill first backfills the evidence in order — bmad-testarch-automate, bmad-testarch-trace (which writes the gate decision), bmad-testarch-nfr — then runs gate_eval.py. The script reads TEA’s gate-decision.json and returns a verdict the skill executes: advance (move to the next story), defer (append non-blocking items to the ledger and advance anyway), reloop (run bmad-correct-course, re-run the story within the remaining budget), or escalate (stop). The invariant: a P0/critical FAIL never defers — it re-loops within budget or escalates. See the gate model and references/gate.md.

This is how the verdict is read deterministically — the conductor never grades the work itself, it runs the script and routes on what comes back:

sequenceDiagram
    participant C as Conductor
    participant TEA as TEA trace
    participant G as gate_eval.py
    participant F as gate-decision.json
    C->>TEA: bmad-testarch-trace writes gate decision
    C->>G: run gate_eval.py --trace-output DIR
    G->>F: resolve and read slim file
    alt slim file absent
        G->>F: fall back to e2e-trace-summary.json
    end
    F-->>G: gate_status
    Note over G: PASS or WAIVED to advance, CONCERNS to defer, FAIL to reloop, NOT_EVALUATED to escalate
    Note over G: production only — NFR FAIL or review lt 80 or Block downgrades advance to reloop
    G-->>C: verdict + reasons JSON
    C->>C: route the verdict advance / defer / reloop / escalate

The production AND fails closed: a missing or unparseable nfr-assessment.md or test-review.md is treated as a failing signal, so an otherwise-advance story downgrades to reloop rather than advancing on evidence the script could not read.

Make the run pay off for the next one. Capture learnings deliberately — machine-local quirks to Auto Memory (remember X), team standards to the project’s CLAUDE.md or .claude/rules. Optionally run the retrospective (--retro). Audit every .decision-log.md entry into the report, the addendum, or explicit process-noise. Produce a run-report.md (Epic, profile, per-story outcomes, the Epic-level gate, budget consumed, learnings, a pointer to the ledger), write the terminal run-status.json, surface this Epic’s deferred-work ledger heading to the user, and fire the on_epic_complete hook only when the Epic actually advanced. See references/finalize.md.

The production profile wires the full TEA chain as gates: test-design, atdd, automate, test-review, nfr, trace, ci. --light downscopes to the trace gate only — Stage 5 skips automate/nfr/test-review backfill and runs only bmad-testarch-trace, then gate_eval.py --profile light, with no NFR/review AND. The profile is locked in Stage 1 and read (not re-derived) by Stages 3 and 5.

The run’s .decision-log.md — held in the skill’s run folder — is canonical memory. Compaction can drop everything else; the log recovers full state. It records scope, the preflight verdict, every gate outcome, every deferral, and (in headless) every assumption. Resume reads it: on a resumed run, Execute re-enters at the first story whose last logged gate verdict is not advance; advanced stories are not re-run, and the Epic branch, hooks, and allowlist are re-asserted (not rebuilt) before continuing.

At the end, two durable outputs sit beside the decision log. The run report (run-report.md) is the human takeaway. The deferred-work ledger (at deferred_work_path) holds one heading per Epic with a row per parked item — only non-gate-blocking work lands here (CONCERNS, non-critical findings, parked decisions); a P0/critical FAIL is never deferred. Finalize surfaces this run’s Epic heading so nothing parked is invisible at handoff.

With -H, the run is non-interactive: infer scope, default to production (unless --light), never prompt. Every exit point — a complete run at Stage 6, or an early block at Stage 1 (not a BMAD project / Epic unresolved / already complete), Stage 2 (preflight), or a Stage 6 story escalation — emits one object with all five keys always present, null when an artifact was not produced, and reason carrying a one-line cause only when blocked:

{"status": "complete|blocked",
"skill": "ultracode-goal",
"decision_log": "<path to this run's .decision-log.md>",
"report": "<path to run-report.md, or null>",
"deferred_work": "<path to deferred-work.md, or null>",
"reason": "<one line, present only when blocked>"}

An automator parses one schema regardless of where the run stopped; a blocked-before-report exit returns report and deferred_work as null rather than omitting them.