Confidence-formula audit — Increment 1 Step 1 findings

Decision-gate result

92% overall agreement, ✓ proceed. All five shipped rule packs cleared the 80% threshold:

Pack	Rules	Agreement	Verdict
Cyberbase Actual Structure	10	100%	✓ proceed
Johnny Decimal	1	100%	✓ proceed
PARA	4	100%	✓ proceed
SEACOW(r) Cyberbase Structure	6	100%	✓ proceed
SEACOW outer shell	5	60%¹	⚠ tie-resolution artifact

¹ The SEACOW-outer 60% is a sort-stability artifact, not a formula problem — see “Caveats” below. Effective agreement is 100% once the artifact is accounted for.

Recommendation: proceed to Increment 1 Step 2 (swap sort order in findBestMatch so confidence becomes primary, priority becomes tiebreak override).

How the audit works

scripts/audit-confidence-formula.ts loads each shipped rule pack JSON, applies the new calculateMatchConfidence (Formula 3 from the specificity + groups research) to each rule, and compares the confidence-derived ordering against the user-authored priority ordering.

For each pack, agreement is the percentage of rules whose rank in confidence-order matches their rank in priority-order. If the rankings agree, swapping the sort key in findBestMatch won’t change the engine’s behavior on that pack — it’ll just produce the same decisions through a more principled mechanism.

If the rankings disagree significantly, the formula needs refinement before promotion (because swapping the sort would break user expectations).

Caveats

The audit script uses the rule’s pattern itself as the input to calculateMatchConfidence. This is a stable proxy for ordering analysis but has one quirk:

When the input equals the pattern, the function hits its exact-match shortcut (if (pattern === input) return 1.0;). Every rule scores 1.000 — at the maximum.

When all scores tie at 1.000, the sort becomes unstable; adjacent rules can swap rank without any actual difference in score. This is what produced the SEACOW-outer “disagreement” — output-public-taxonomy (priority 4) and output-main-public (priority 5) both scored 1.000; the sort happened to put output-main-public first by alphabetical tiebreak, producing a 1-rank drift each direction.

This isn’t a formula problem. It’s an audit-script limitation. A more rigorous audit would use representative folder paths and tag strings (drawn from the test vault) as inputs rather than the pattern itself — that would produce non-1.0 scores and break the tie. We can ship that as a follow-up if the Step-2 sort-swap surfaces real-world disagreements that this audit didn’t catch.

Per-pack details

(Full per-pack tables — priority order, confidence order, disagreements — are in the auto-generated audit-confidence-formula.md at the repo root, regenerated by bun scripts/audit-confidence-formula.ts. The tables are too long for this docs entry; the summary above captures the load-bearing finding.)

Why Cyberbase Actual / Johnny Decimal / PARA / SEACOW Cyberbase agree at 100%

These four packs have user-authored priorities that already correspond to the order Formula 3 produces. Specifically:

PARA: priorities 10, 20, 30, 40 align with rule names para-projects, para-areas, para-resources, para-archive — alphabetical-by-rule-id order is preserved by the formula’s stable sort
Cyberbase Actual: 10 rules with priorities 1, 2, 3, 4, 5, 6, 10, 11, 12, 13 — strictly increasing priority and the formula scores all rules at 1.000 (exact-match shortcut), so the stable sort preserves authoring order
JD has only 1 rule (no comparison possible)
SEACOW Cyberbase has 6 rules with priorities 1, 2, 3, 4, 5, 6 — same story

The lesson: for the canonical PARA / JD / SEACOW patterns, the new formula produces an ordering that matches user intuition (which itself was authored as ascending priority).

Why SEACOW outer shell shows 60% agreement

As noted in the Caveats: this is a sort-stability artifact when all scores tie at 1.000. The two “disagreeing” rules (output-public-taxonomy priority 4 and output-main-public priority 5) both score 1.000 in the audit. They’re adjacent in priority but adjacent-and-flipped in the confidence sort because of how the stable sort breaks ties.

In real-world matching (where the input is a folder path like Output/Public/Security/... rather than the pattern itself), output-public-taxonomy’s pattern ^Output/Public(?:/|$) would score higher than output-main-public’s pattern ^Output/Main(?:/|$) because the input would partial-match different amounts. The audit’s pattern-as-input approach hides this.

What this means for Increment 1 Step 2

Proceed. Swapping the sort key in findBestMatch from priority → confidence-tiebreak to confidence → priority-tiebreak will:

Not change behavior on any of the canonical shipped packs (they already agree)
Fix the Challenge 01 stress case (more-specific rules will fire first without manual priority swapping)
Preserve user-authored priority as the manual override for the rare cases where confidence is genuinely ambiguous

Open question for the user-testing checkpoint: do we want to ship a follow-up audit that uses representative folder paths / tags as inputs (rather than pattern-as-input)? It’s a small enhancement; would surface genuine disagreements that the current audit’s exact-match shortcut hides.

Next concrete actions

Ship Phase C (this audit, the formula refinement, the new tests) as commits — done
User reviews this audit, confirms the decision-gate result
If confirmed, proceed to Increment 1 Step 2: swap sort order in findBestMatch
If not confirmed (e.g., user wants the more-rigorous audit first), pause and refine before promoting

Specificity + groups research — the source of Formula 3
Tag → folder resolution research — the broader inverse-direction problem
Development plan — Increment 1 phasing
Challenge 01 — the stress case the new formula resolves