Challenge 37: Tier 2 scale model under ontology-web framing (RERUN of Ch 18)
Predecessor + what’s different now
Section titled “Predecessor + what’s different now”Original Ch 18 (resolved 2026-05-02):
- Asked: scale ceiling for sqlite-wasm + sqlite-vec + simple-graph + recursive-CTE; rule expressivity matrix for SSSOM subset
- Verdict: ~100K mappings comfortably viable; rule expressivity matrix (4 ✅ / 4 ⚠️ / 1 ❌); migration trigger spec
- Outcome: Tier 2-Lite promoted to default-bundled v0.1 sidecar
What’s different now (2026-05-08 rerun):
- Scale assumption — original sized for GRC vault (~5K controls × 8 frameworks ≈ 40K junctions, well under 100K ceiling); rerun asks for ontology-web vaults
- Reference scales — BioPortal hosts 700+ ontologies; UMLS has 3.5M concepts; OBO Foundry has hundreds × thousands of terms each; NIST OLIR has thousands of crosswalks across many frameworks
- Substrate shifted — sqlite-vec deferred per WASM-A pivot; no vector layer in v0.1
- Framing shifted — original was SSSOM-rule-focused; rerun is general ontology-web query engine
- Sister challenges — Ch 33 (engine landscape) + Ch 34 (streaming) bracket this with broader context
Why this exists (under the new framing)
Section titled “Why this exists (under the new framing)”Crosswalker’s Tier 2 sidecar (v0.1.5 shipped) runs sqlite-wasm with recursive CTE for closure queries. At GRC scale, this comfortably fits in browser memory and renders within ms-to-s. Ontology-web scale is fundamentally different:
| Scale tier | Concepts | Mappings | sqlite-wasm feasible? |
|---|---|---|---|
| GRC small | ~1K | ~5K | ✅ trivially |
| GRC medium | ~10K | ~50K | ✅ comfortable |
| GRC large (Ch 18 ceiling) | ~50K | ~100K | ✅ near ceiling |
| OLIR scale | ~10K | ~10K (across 100+ frameworks) | ✅ probably |
| OBO Foundry single ontology | ~100K | ~500K (with hierarchy) | ⚠️ needs validation |
| BioPortal (700 ontologies) | ~1M (ontology metadata) + ~100M (concepts across all) | ❌ infeasible in-browser | |
| UMLS (3.5M concepts + cross-ontology mappings) | ~3.5M | ~50M+ | ❌ infeasible in-browser |
Most users won’t load the entire BioPortal or UMLS into Obsidian. But the architecture must handle SOME ontology-web scale — at least medium (OBO Foundry single-ontology or OLIR-scale).
What we already have
Section titled “What we already have”| Asset | What it gives us |
|---|---|
| Ch 18 archived brief + deliverable | Original ~100K ceiling; rule expressivity matrix; engineering scale model |
| v0.1.5 Tier 2 sidecar shipped | Working sqlite-wasm sidecar; closure cache; lazy materialization |
| WASM-A pivot synthesis | sqlite-vec deferred; plain sqlite-wasm; 2026-11-06 revisit anchored |
| Ch 24 synthesis | sqlite-wasm + 5 migration triggers |
What to investigate
Section titled “What to investigate”1. Re-run the scale model under ontology-web framing
Section titled “1. Re-run the scale model under ontology-web framing”Original Ch 18 scale model was sized for SSSOM mappings (rule-expressed graph). The rerun applies the same model to:
- OBO Foundry single ontology (gene-ontology = ~50K terms, ~250K relations). Does sqlite-wasm + recursive CTE handle closure queries within acceptable latency?
- OBO Foundry multi-ontology (5-10 ontologies federated) — 500K-1M concepts, 5M-10M relations
- NIST OLIR scale (1000s of crosswalks across 50+ frameworks) — what’s the realistic concept × mapping count?
- BioPortal subset (10-50 ontologies a user might import — not the full 700)
- UMLS subset (UMLS has Metathesaurus tables; user might import a subset for their domain)
For each: latency profile, memory profile, what queries break first.
2. Identify scale-breaking query patterns
Section titled “2. Identify scale-breaking query patterns”At what scale do specific query primitives break?
- filter — fast at any scale (indexed)
- project — fast at any scale
- traversal (1-hop) — fast at any scale (indexed by predicate)
- closure (recursive CTE) — depth-bounded breaks at large depths; what’s the practical depth ceiling at OBO scale?
- anti-join — works against indexes; at what scale does it spill?
- pivot — full cross-product is N² in axis size; what’s feasible?
- aggregate — count/sum/avg are linear; usually fine
For each primitive: scale ceiling × failure mode.
3. Reconcile with shipped v0.1.5 closure cache behavior
Section titled “3. Reconcile with shipped v0.1.5 closure cache behavior”The closure cache materializes results lazily on first query. At ontology-web scale:
- How big does the cache get? Does it fit in browser memory?
- How long does first-query take? (Cold cache materialization)
- Does invalidation (any mappings change → full recompute) make the cache useless at scale?
- Should the cache be partial (per-start-concept rather than full closure)?
4. Map / partition strategies
Section titled “4. Map / partition strategies”Beyond just “scale to N concepts,” explore strategies that bound the working set:
- Per-ontology partitioning — query within one ontology at a time; cross-ontology federation only when necessary
- Per-recipe scope — recipes declare their concept-id-set; engine works only within scope
- On-demand subgraph extraction — query specifies a starting concept + depth; engine extracts subgraph; queries operate on subgraph
- Top-N pruning — for closure, keep only top-N most-relevant paths (relevance = confidence × inverse-depth)
- Lazy materialization with eviction — cache closures with LRU eviction at memory threshold
For each strategy: what’s the user UX impact? When does it pay off?
5. Cross-tier offload: when to move to Tier 3
Section titled “5. Cross-tier offload: when to move to Tier 3”Ch 16 / Ch 24 keep Apache Jena Fuseki + oxigraph-server as Tier 3 server-side path. At ontology-web scale, Tier 3 may be the only viable path:
- At what scale does Tier 2 (in-browser sqlite-wasm) become infeasible?
- What’s the user experience of “this query is too big; offload to Tier 3”?
- Is the v0.1.6 architecture compatible with future Tier 3 offload, or does it need to be redesigned?
6. Migration triggers (the load-bearing output)
Section titled “6. Migration triggers (the load-bearing output)”Ch 24’s 5 migration triggers need updating per ontology-web framing:
- Vector extension packaging — does scale change urgency?
- WASM bundle size — n/a for this challenge
- Closure query latency — what’s the ontology-web ceiling?
- Mobile / low-end performance — does scale make mobile harder?
- Federation requirement — does scale force federation?
Add new triggers if needed. Output: revised trigger list ready for Ch 24 update.
Anti-patterns to reject upfront
Section titled “Anti-patterns to reject upfront”The deliverable must NOT recommend:
- Migrating off sqlite-wasm pre-emptively — Ch 24 and v0.1.5 P3 confirmed; reversal needs hard evidence
- Loading BioPortal / UMLS in full into Obsidian — out of scope; user imports a domain subset
- Cross-vault federation — explicitly out of scope per Ch 27/28
- Speculative architecture overhauls — keep recommendations within v0.1.7+ scope
- Drop in-browser query and require server — Crosswalker’s value is vault-native; server is escape hatch
- Reintroducing libSQL / Turso / Limbo — Ch 24 rejected
- Ignoring Mobile — solutions must degrade gracefully on Obsidian Mobile
Success criteria for the deliverable
Section titled “Success criteria for the deliverable”The deliverable must produce:
- Re-run scale model — OBO Foundry, OLIR, BioPortal-subset, UMLS-subset; latency/memory/feasibility per primitive
- Scale-breaking query catalog — which primitive at which scale, failure mode, mitigation
- Closure cache analysis — at scale, does the cache work? if not, alternative
- Map/partition strategy survey — 4+ strategies × tradeoffs
- Tier 3 offload threshold — when does Tier 3 become required (not just optional)?
- Updated migration triggers — revisions to Ch 24’s 5 triggers
- Recommended v0.1.7+ scope additions — what scale-handling work belongs in next milestones
- Verdict on original Ch 18 ceiling — REAFFIRMED / REVISED / DEFERRED-TO-LATER
Anchored references
Section titled “Anchored references”Predecessor:
Project context:
- v0.1.5 Tier 2 sidecar shipped — what’s running today
- Ch 24 synthesis — substrate + migration triggers
- WASM-A pivot synthesis
concepts/embedded-vs-server-substrates
Reference scales:
- BioPortal — 700+ ontologies hosted
- OBO Foundry — biomedical ontology library
- UMLS Metathesaurus — 3.5M concepts
- NIST OLIR — crosswalk catalog
Sister challenges:
- Ch 33 — Multi-modal landscape audit — substrate alternatives at scale
- Ch 34 — Streaming / chunked execution — chunked execution for queries that exceed RAM
- Ch 35 — Graph→tabular bridging rerun — pivot at scale
Hand-off
Section titled “Hand-off”Write the deliverable to docs/.../zz-research/YYYY-MM-DD-challenge-37-deliverable-a-<slug>.md. After deliverable lands: flip synthesis log §9 status Ch 37 row from ⏳ to ✅; update Ch 18 archived brief with :::note callout pointing to this rerun; update Ch 24 migration triggers per findings; if verdict revises original Ch 18 ceiling, document explicitly in synthesis log; archive this brief.