Skip to content

Confidence-formula audit — Increment 1 Step 1 findings

92% overall agreement, ✓ proceed. All five shipped rule packs cleared the 80% threshold:

PackRulesAgreementVerdict
Cyberbase Actual Structure10100%✓ proceed
Johnny Decimal1100%✓ proceed
PARA4100%✓ proceed
SEACOW(r) Cyberbase Structure6100%✓ proceed
SEACOW outer shell560%¹⚠ tie-resolution artifact

¹ The SEACOW-outer 60% is a sort-stability artifact, not a formula problem — see “Caveats” below. Effective agreement is 100% once the artifact is accounted for.

Recommendation: proceed to Increment 1 Step 2 (swap sort order in findBestMatch so confidence becomes primary, priority becomes tiebreak override).

scripts/audit-confidence-formula.ts loads each shipped rule pack JSON, applies the new calculateMatchConfidence (Formula 3 from the specificity + groups research) to each rule, and compares the confidence-derived ordering against the user-authored priority ordering.

For each pack, agreement is the percentage of rules whose rank in confidence-order matches their rank in priority-order. If the rankings agree, swapping the sort key in findBestMatch won’t change the engine’s behavior on that pack — it’ll just produce the same decisions through a more principled mechanism.

If the rankings disagree significantly, the formula needs refinement before promotion (because swapping the sort would break user expectations).

The audit script uses the rule’s pattern itself as the input to calculateMatchConfidence. This is a stable proxy for ordering analysis but has one quirk:

When the input equals the pattern, the function hits its exact-match shortcut (if (pattern === input) return 1.0;). Every rule scores 1.000 — at the maximum.

When all scores tie at 1.000, the sort becomes unstable; adjacent rules can swap rank without any actual difference in score. This is what produced the SEACOW-outer “disagreement” — output-public-taxonomy (priority 4) and output-main-public (priority 5) both scored 1.000; the sort happened to put output-main-public first by alphabetical tiebreak, producing a 1-rank drift each direction.

This isn’t a formula problem. It’s an audit-script limitation. A more rigorous audit would use representative folder paths and tag strings (drawn from the test vault) as inputs rather than the pattern itself — that would produce non-1.0 scores and break the tie. We can ship that as a follow-up if the Step-2 sort-swap surfaces real-world disagreements that this audit didn’t catch.

(Full per-pack tables — priority order, confidence order, disagreements — are in the auto-generated audit-confidence-formula.md at the repo root, regenerated by bun scripts/audit-confidence-formula.ts. The tables are too long for this docs entry; the summary above captures the load-bearing finding.)

Why Cyberbase Actual / Johnny Decimal / PARA / SEACOW Cyberbase agree at 100%

Section titled “Why Cyberbase Actual / Johnny Decimal / PARA / SEACOW Cyberbase agree at 100%”

These four packs have user-authored priorities that already correspond to the order Formula 3 produces. Specifically:

  • PARA: priorities 10, 20, 30, 40 align with rule names para-projects, para-areas, para-resources, para-archive — alphabetical-by-rule-id order is preserved by the formula’s stable sort
  • Cyberbase Actual: 10 rules with priorities 1, 2, 3, 4, 5, 6, 10, 11, 12, 13 — strictly increasing priority and the formula scores all rules at 1.000 (exact-match shortcut), so the stable sort preserves authoring order
  • JD has only 1 rule (no comparison possible)
  • SEACOW Cyberbase has 6 rules with priorities 1, 2, 3, 4, 5, 6 — same story

The lesson: for the canonical PARA / JD / SEACOW patterns, the new formula produces an ordering that matches user intuition (which itself was authored as ascending priority).

Why SEACOW outer shell shows 60% agreement

Section titled “Why SEACOW outer shell shows 60% agreement”

As noted in the Caveats: this is a sort-stability artifact when all scores tie at 1.000. The two “disagreeing” rules (output-public-taxonomy priority 4 and output-main-public priority 5) both score 1.000 in the audit. They’re adjacent in priority but adjacent-and-flipped in the confidence sort because of how the stable sort breaks ties.

In real-world matching (where the input is a folder path like Output/Public/Security/... rather than the pattern itself), output-public-taxonomy’s pattern ^Output/Public(?:/|$) would score higher than output-main-public’s pattern ^Output/Main(?:/|$) because the input would partial-match different amounts. The audit’s pattern-as-input approach hides this.

Proceed. Swapping the sort key in findBestMatch from priority → confidence-tiebreak to confidence → priority-tiebreak will:

  • Not change behavior on any of the canonical shipped packs (they already agree)
  • Fix the Challenge 01 stress case (more-specific rules will fire first without manual priority swapping)
  • Preserve user-authored priority as the manual override for the rare cases where confidence is genuinely ambiguous

Open question for the user-testing checkpoint: do we want to ship a follow-up audit that uses representative folder paths / tags as inputs (rather than pattern-as-input)? It’s a small enhancement; would surface genuine disagreements that the current audit’s exact-match shortcut hides.

  1. Ship Phase C (this audit, the formula refinement, the new tests) as commits — done
  2. User reviews this audit, confirms the decision-gate result
  3. If confirmed, proceed to Increment 1 Step 2: swap sort order in findBestMatch
  4. If not confirmed (e.g., user wants the more-rigorous audit first), pause and refine before promoting