Verifying a Skill
Nothing is made up. Every instruction in every skill traces back to a specific file, a specific line, and a specific commit in the upstream source. If a skill claims a function exists, you can open the real source tree at the pinned commit and see it with your own eyes. If the claim and the source disagree — that’s a bug, and SKF treats it as one.
The three-step audit
Section titled “The three-step audit”Pick any symbol in any SKF-compiled skill. You can trace it to the exact line of upstream source in under 60 seconds.
1. Open the skill’s metadata.json
Section titled “1. Open the skill’s metadata.json”Every skill ships a metadata.json next to its SKILL.md. Note two fields:
source_commit— the exact commit SHA the skill was compiled fromsource_repo— the upstream repository
This is the anchor. Everything else traces back to this commit.
2. Open the skill’s provenance-map.json
Section titled “2. Open the skill’s provenance-map.json”Provenance maps live in forge-data/{skill}/{version}/provenance-map.json alongside each compiled skill. Find your symbol. Every entry carries its own source_file and source_line:
{ "export_name": "search", "export_type": "function", "params": ["query_text: str", "query_type: SearchType = GRAPH_COMPLETION", "top_k: int = 10"], "return_type": "List[SearchResult]", "source_file": "cognee/api/v1/search/search.py", "source_line": 27, "confidence": "T1", "extraction_method": "ast-grep"}The snippet above is a real entry from forge-data/oms-cognee/1.0.0/provenance-map.json. Line number is not rounded. Confidence tier is explicit. Extraction method is named. Nothing is paraphrased.
3. Visit the upstream repo at the pinned commit
Section titled “3. Visit the upstream repo at the pinned commit”Open {source_repo} at {source_commit}, jump to {source_file} line {source_line}. The signature in SKILL.md should match what you see in the source.
If it doesn’t, that’s a bug. Open an issue. SKF will republish the skill with a new commit SHA and a new provenance map. Falsifiability isn’t a feature — it’s the whole deal.
Workflow-time enforcement
Section titled “Workflow-time enforcement”The same anchor is enforced automatically by skf-test-skill and by gap-driven skf-update-skill. Before either workflow reads source at a recorded source_line, it runs git rev-parse HEAD on the local workspace and compares it to metadata.source_commit. If the workspace has drifted, the workflow halts with a halted-for-workspace-drift status and tells you the exact git checkout {source_ref} to re-sync — so spot-checks can never silently verify against the wrong tree. Pass --allow-workspace-drift to opt in to reading the current HEAD anyway; the override is recorded in the final report rather than hidden.
Where to look for what
Section titled “Where to look for what”Every file in the per-skill output carries a specific job. Here’s the lookup table for the really skeptical:
| Question | File |
|---|---|
| What commit was the source pinned to? | skills/{name}/{version}/{name}/metadata.json → source_commit |
| Which symbols are documented and where did each come from? | forge-data/{name}/{version}/provenance-map.json |
| What AST patterns were used for extraction? | forge-data/{name}/{version}/extraction-rules.yaml |
| What signatures, types, and examples did the extractor actually capture? | forge-data/{name}/{version}/evidence-report.md |
| How was the skill scored? Show me the math. | forge-data/{name}/{version}/test-report-{name}.md |
| How was the skill scoped, and what was deliberately left out? | forge-data/{name}/skill-brief.yaml |
Everything a reader needs to reconstruct the compilation is in the two sibling directories: skills/ ships to consumers, forge-data/ is the audit trail.
The scores, including the ones we lose
Section titled “The scores, including the ones we lose”Completeness scoring is never 100%. The scoring formula is deterministic and the pass threshold is 80% — but every test report also logs the specific edges where a skill falls short, so the numbers aren’t marketing.
Take oh-my-skills’ four reference skills as an example. Their scores range from 99.0% to 99.49% — none are perfect, and every test report names the specific drift it found:
| Skill | Score | What the report discloses |
|---|---|---|
| oms-cocoindex | 99.0% | 114/114 provenance entries; 55 public-API denominator from __init__.py __all__; 20/20 sampled signatures matched. Two denominators (barrel vs. full surface) both disclosed with rationale. |
| oms-cognee | 99.0% | 34/34 exports documented; denominator is the cognee/__init__.py barrel (61 lines, 34 public re-exports) at pinned commit 3c048aa4 (v1.0.0). |
| oms-storybook-react-vite | 99.49% | 215/216 documented — the missing 1 entry is logged openly as GAP-004, a canonical surface count drift from the stated denominator. |
| oms-uitripled | 99.45% | 34-entry denominator (not 11, not 25) with the full reconciliation reasoning in the report. |
Perfection is suspicious. Visible fallibility is trustworthy. SKF writes down the edges it can’t score cleanly — so you can read them and decide for yourself whether the remaining coverage is enough for your use case.
GAP-004: a worked example of the 1% that fails
Section titled “GAP-004: a worked example of the 1% that fails”The oms-storybook-react-vite test report scores 215/216 — not 216/216. The missing 1 entry is logged as GAP-004: a canonical export surface count (via the provenance map) diverges from the stated denominator in metadata.json. The report names the gap, shows the math, and leaves the drift visible for the next recompilation pass. Nothing was hidden.
That’s the pattern SKF asks you to trust: when scoring can’t reach 100%, the report says so, cites the line, and leaves a fingerprint for the next audit.
How the Score Is Computed
Section titled “How the Score Is Computed”The Test Skill workflow (@Ferris TS) calculates the completeness score — a weighted measure of how thoroughly and accurately a skill documents its target. This score is the quality gate: pass and the skill is ready for export; fail and it routes to update-skill for remediation.
Categories and weights
Section titled “Categories and weights”The score is the weighted sum of five categories:
| Category | Weight | What it measures |
|---|---|---|
| Export Coverage | 36% | Percentage of source exports documented in SKILL.md |
| Signature Accuracy | 22% | Documented function signatures match actual source signatures (parameter names, types, order, return types) |
| Type Coverage | 14% | Types and interfaces referenced in exports are fully documented |
| Coherence | 18% | Cross-references resolve, integration patterns are complete (contextual mode only) |
| External Validation | 10% | Average of skill-check quality score (0–100) and tessl content score (0–100%) |
Formula
Section titled “Formula”total_score = sum(category_weight × category_score)Each category score is a percentage: (items_passing / items_total) × 100.
Coherence (contextual mode) combines two sub-scores:
coherence = (reference_validity × 0.6) + (integration_completeness × 0.4)If no integration patterns exist, coherence equals reference validity alone.
External validation averages the two tools when both are available. When only one tool is available, that tool’s score is used. When neither is available, the 10% weight is redistributed proportionally to the other active categories.
Deterministic scoring
Section titled “Deterministic scoring”The weight redistribution and score aggregation are computed by a deterministic Python script (compute-score.py). The LLM extracts category scores from the test report, constructs a JSON input, invokes the script, and uses its output for the final score. Same inputs always produce the same score. If the script is unavailable, the LLM falls back to manual calculation using the same formulas.
Naive vs contextual mode
Section titled “Naive vs contextual mode”Test Skill runs in one of two modes, detected automatically:
- Contextual mode (stack skills) — all five categories scored with the default weights above.
- Naive mode (individual skills) — Coherence is not scored. Its 18% weight is redistributed:
| Category | Naive Weight |
|---|---|
| Export Coverage | 45% |
| Signature Accuracy | 25% |
| Type Coverage | 20% |
| External Validation | 10% |
Tier adjustments
Section titled “Tier adjustments”Your forge tier determines which categories can be scored:
| Tier | Skipped Categories | Reason |
|---|---|---|
| Quick | Signature Accuracy, Type Coverage | No AST parsing available |
| Docs-only | Signature Accuracy, Type Coverage | No source code to compare against |
| Provenance-map (State 2) | Signature Accuracy, Type Coverage | String comparison only, no semantic AST verification |
| Forge / Forge+ / Deep | None | Full AST-backed scoring |
When categories are skipped, their combined weight is redistributed proportionally to the remaining active categories. A Quick-tier skill and a Deep-tier skill both pass at the same 80% threshold — the score reflects what your tier can actually measure.
Hard gate
Section titled “Hard gate”Before scoring, a hard gate scans all findings for Critical and High severity. If any are present, the pipeline blocks immediately — no score is computed and the skill cannot pass regardless of coverage percentage. Medium, Low, and Info findings pass through to scoring.
Pass/fail
Section titled “Pass/fail”threshold = pipeline_default OR custom_threshold OR 80% (default)
score >= threshold → PASS → Recommend export-skillscore < threshold → FAIL → Recommend update-skillThe default threshold is 80%. Pipeline aliases declare their own defaults: deepwiki targets 90%, forge and forge-quick target 80%. You can override any default with a custom threshold (e.g., TS[min:85]).
When a skill scores between 80% and its target threshold (e.g., 82% against a 90% target), the soft gate falls back to the 80% floor — the skill passes, but an evidence report is written to forge-data/{skill}/{version}/evidence-report-fallback.md documenting the quality compromise.
Gap severities
Section titled “Gap severities”When the score is calculated, each finding is classified by severity to guide remediation:
| Severity | Examples |
|---|---|
| Critical | Missing exported function/class documentation |
| High | Signature mismatch between source and SKILL.md |
| Medium | Missing type/interface documentation; scripts/assets directory inconsistencies |
| Low | Missing optional metadata or examples; description optimization opportunities |
| Info | Style suggestions; discovery testing recommendations |
Score report output
Section titled “Score report output”The test report includes a score breakdown table showing each category’s raw score, weight, and weighted contribution:
| Category | Score | Weight | Weighted |
|---|---|---|---|
| Export Coverage | 92% | 36% | 33.1% |
| Signature Accuracy | 85% | 22% | 18.7% |
| Type Coverage | 100% | 14% | 14.0% |
| Coherence | 80% | 18% | 14.4% |
| External Validation | 78% | 10% | 7.8% |
| Total | 100% | 88.0% |
The report also records analysisConfidence (full, provenance-map, metadata-only, remote-only, or docs-only) and includes a degradation notice when source access was limited.
Build-time drift detection (for docs themselves)
Section titled “Build-time drift detection (for docs themselves)”The SKF docs you’re reading right now are themselves verified against oh-my-skills. A docs/_data/pinned.yaml anchor file records the exact version, commit SHA, and confidence tier of every reference skill. A Node validator (tools/validate-docs-drift.js) runs as part of npm run quality and:
- Confirms canonical truth — every anchor in
pinned.yamlis cross-checked against the actualmetadata.jsonin oh-my-skills. Version, commit, tier, and authority must match. - Scans docs for stale prose — every
.mdfile is grepped for<library> v?<x.y.z>patterns and any version that disagrees withpinned.yamlis flagged with file + line number.
If the validator flags drift, the CI fails before the docs get merged. It’s the same “nothing is made up” contract SKF applies to skills, applied to the docs that describe SKF. When the anchor file is updated to reflect a new oh-my-skills release, the prose must update too — otherwise npm run docs:validate-drift blocks the merge.
You can run it yourself from the SKF repo root:
npm run docs:validate-driftOr point it at a different local copy of oh-my-skills:
OMS=/path/to/your/oh-my-skills npm run docs:validate-driftClean output looks like this:
OK: 4 skills checked against /home/you/oh-my-skills, no drift.Dirty output cites exact file:line locations so the fix is mechanical.
Reference output: oh-my-skills
Section titled “Reference output: oh-my-skills”Every example in this page points at oh-my-skills, the SKF reference portfolio. Four Deep-tier skills (cocoindex, cognee, Storybook v10, uitripled), each shipping its full audit trail alongside the compiled skill. Both the worked example for this page and the continuing proof that the pipeline does what it says. If you want to see what SKF produces when you run it on real libraries, that’s the answer.