Ch 37: Tier 2 scale model under ontology-web framing (rerun of Ch 18)
Challenge 37 — Tier 2 Scale Model Under Ontology-Web Framing (RERUN of Ch 18)
Section titled “Challenge 37 — Tier 2 Scale Model Under Ontology-Web Framing (RERUN of Ch 18)”- The original Ch 18 ~100K-mapping ceiling for GRC vaults is REAFFIRMED, not revised — but the meaning of “ceiling” must be reframed: it is a per-query working-set ceiling, not a database-size ceiling. Plain
sqlite-wasm+ recursive CTEs comfortably handles single-ontology imports up to roughly GO/HPO scale (~50K nodes, ~250K edges, ~500K materialized closure rows) on desktop, and this is the realistic everyday Crosswalker workload. - The use case shift to ontology-web framing does NOT justify migrating off sqlite-wasm or adding a vector layer. It does, however, justify shipping (a) a per-ontology partition model, (b) a bounded materialized closure cache with LRU eviction, and (c) an explicit “this query is too big for Tier 2” tripwire that points users at Tier 3 (Fuseki/oxigraph-server) — instead of silently OOM-ing the Electron renderer at the V8 4 GB pointer-compression cage.
- Hard infeasibility starts above ~500K concepts / ~5M edges per active query scope (full multi-ontology OBO bundles, full SNOMED CT closure, any UMLS subset >~10% of Metathesaurus, any “all-pairs” pivot above ~1K × ~1K). These are Tier 3 workloads, full stop. Mobile (Obsidian iOS/Android, ~1.5–2 GB renderer ceiling) is hard-limited at roughly one-fifth of desktop scale and should be treated as a degraded read-only mode for any non-GRC ontology.
Key Findings
Section titled “Key Findings”1. Reference dataset sizes (concrete, current as of 2024–2026 releases).
| Source | Concepts/terms | Edges/relations | Notes |
|---|---|---|---|
| Gene Ontology (basic ed.) | ~43K terms | ~88K relations (basic) / ~134K (full) | + 7.6M annotations not relevant to crosswalk Tier 2 |
| Human Phenotype Ontology (HPO) | ~18K–20K terms (+2,239 added since prior report) | ~30K–40K relations | Small/medium |
| ChEBI | ~195K entries (full); ~14K curated metabolites | ~500K+ relations incl. role/structure | Mid-size |
| SNOMED CT (Intl.) | ~370K active concepts (+957K descriptions) | ~1.37M relationships | Materialized closure ~6.5M rows (per SNOMED Intl. Perl script) |
| UMLS Metathesaurus 2025AB | 3.49M concepts, 17.4M atoms | tens of millions of MRREL relationships | 190 source vocabs; full release ~35 GB on disk |
| BioPortal (May 2025 figures) | 1,549 ontologies (1,182 public), 15.3M terms | >100M cross-ontology mappings | Median individual ontology is small; the long tail is what matters |
| OBO Foundry | ~250 active ontologies passing dashboard | varies | Most are <50K classes; a handful (CL, Uberon, NCIT, SNOMED-via-Bioportal) push 100K+ |
| NIST OLIR catalog | dozens of OLIRs across ~10–15 focal documents | each typically 100–1,500 element-pair mappings | Small enough to be a non-issue; Crosswalker-native today |
2. sqlite-wasm performance characteristics, current state (Nov 2025 baselines).
- Storage backend matters more than the engine:
wa-sqliteOPFSCoopSyncVFS (and the official sqlite-wasm OPFS SAHPool) sustain good performance past 1 GB of database; the older IDBBatchAtomicVFS degrades sharply above ~100 MB. - Bundle size: official
@sqlite.org/sqlite-wasmis ~470 KB JS + ~900 KB WASM;wa-sqliteis ~230 KB + ~570 KB. - Memory ceiling: 32-bit WASM is hard-capped at 4 GB per instance. Obsidian (Electron + V8 pointer compression) caps the renderer at ~4 GB per process, shared with the rest of the Obsidian app. Realistic budget for the Crosswalker sqlite arena: ~500 MB–1 GB on desktop, ~150–300 MB on mobile.
- Recursive CTE behavior: SQLite docs and SNOMED/i2b2 community evidence confirm recursive CTE is functional for closure but is the slow path. SQLite Intl. itself ships a Perl script (not a CTE) to build the SNOMED closure table because all-pairs CTE on 370K nodes is impractical.
- Indexed lookup vs. closure: a single
WHERE parent=?lookup with an index on(parent, child)is sub-millisecond at any feasible Crosswalker scale. The “scale-breaking” primitive is always the recursive closure or the unbounded pivot, never the indexed point query.
3. Materialized closure cache analysis.
- Empirically,
|closure| ≈ k · |edges|where k is roughly the average path length in the DAG. For “is-a” hierarchies, k is typically 5–15. For GO: ~88K direct edges → ~500K–1M closure rows (~16–32 MB at 32 B/row). For SNOMED CT: 1.37M direct → ~6.5M closure rows (~200–260 MB stored). - Cold materialization time: a JS/CTE implementation in a worker will be minutes for SNOMED-class graphs and seconds for GO-class graphs. This is acceptable as a one-time cost on import, not on every query.
- Invalidation is the real problem. A single new
is-aedge in a high-fan-in node can invalidate tens of thousands of closure rows. The viable design is bounded-depth + LRU + lazy recompute.
4. Map/partition strategies — what the comparable tools actually do.
- BioPortal/OLS4 keep ontologies in separate indices and use server-side Neo4j only for graph queries; no tool tries to load the full corpus into a single client.
- Protégé Desktop loads ~90K classes “in <15 seconds” but search becomes laggy at SNOMED scale (370K) even on 8 GB JVMs.
- WebProtégé is server-backed (Tomcat + MongoDB).
- OxO / OxO2 uses Datalog (Nemo) on the server precisely because mapping walks across BioPortal-scale data exceed browser budgets.
- Oxigraph-WASM is in-memory only and explicitly notes “the WASM intrinsic memory limitations hinder the scalability of this approach” at ~725K quads.
The consistent industry pattern: partition by ontology, query a subgraph, never materialize the world.
5. Tier 2 → Tier 3 thresholds. A practical, evidence-grounded threshold matrix (desktop, OPFS-backed sqlite-wasm, single active query):
| Tier | Concepts | Edges | Closure rows | Working-set RAM | Verdict |
|---|---|---|---|---|---|
| Green (Tier 2 routine) | ≤50K | ≤250K | ≤500K | <100MB | GO, HPO, NIST OLIR, ChEBI subsets, all GRC vaults |
| Yellow (Tier 2 with cache+partition) | 50K–250K | 250K–2M | 500K–3M | 100–500 MB | Federated 5–10 OBO ontologies; ChEBI full; UMLS thin domain subsets |
| Orange (Tier 2 read-only / Tier 3 recommended) | 250K–500K | 2M–5M | 3M–8M | 500 MB–1 GB | SNOMED CT; large multi-OBO bundles |
| Red (Tier 3 only) | >500K | >5M | >8M | >1 GB | Full UMLS, BioPortal-scale federation, all-pairs pivots over large axes |
| Mobile | divide all the above by ~5 | <300MB | Obsidian Mobile is a Yellow-only tier |
6. Migration triggers — updated from Ch 24.
| # | Trigger | Original status | Ch 37 update |
|---|---|---|---|
| 1 | Vector extension packaging | Deferred per WASM-A | Unchanged. No ontology-web pressure to add vectors. |
| 2 | WASM bundle size | <2 MB acceptable | Unchanged. wa-sqlite ~800 KB total is fine. |
| 3 | Closure query latency | p95 <500 ms target | Tightened. Add a hard “soft budget” of 250 ms wall-clock with cancellation. |
| 4 | Mobile / low-end | ”monitor” | Elevated. Mobile is now a first-class constraint with its own threshold tier. |
| 5 | Federation requirement | Future | Unchanged but reframed. Federation = Tier 3 by definition. |
| 6 (NEW) | Closure cache memory budget | — | Trigger fires if materialized closure for any single imported ontology exceeds 25% of estimated renderer memory. |
| 7 (NEW) | Per-query working-set ceiling | — | Trigger fires if any single query primitive materializes >1M rows in result-set memory. Force pagination or Tier 3 referral. |
7. Scale-breaking query catalog — which primitive breaks at which scale.
| Primitive | GO scale (50K/250K) | Multi-OBO (500K/5M) | SNOMED (370K/1.37M) | UMLS subset (1M/10M) |
|---|---|---|---|---|
Indexed point lookup (get by id) | <1 ms | <1 ms | <1 ms | <1 ms |
| 1-hop neighbors (filter) | <5 ms | <10 ms | <10 ms | 10–50 ms |
| Bounded-depth closure (depth ≤3) | 5–50 ms | 50–500 ms | 100 ms–1 s | fail without cache |
| Full transitive closure (recursive CTE) | 100 ms–1 s | 5–60 s; OOM-risk | 30 s+; very slow | infeasible |
| Materialize closure table | 1–10 s, ~32 MB | 30 s–5 min, ~300 MB | minutes, ~250 MB | infeasible |
| Anti-join (orphans/missing) | <100 ms | seconds | seconds | minutes |
| Pivot N×M (cross-product) | OK to ~1K×1K | breaks >2K×2K | breaks >2K×2K | always break |
| All-pairs shortest path | infeasible already | infeasible | infeasible | infeasible |
Bottom line: closure and pivot are the universal scale-breakers; everything else scales linearly with proper indexing.
8. Closure cache — viable strategies (4+ with tradeoffs).
| Strategy | Mechanism | Memory | Hit rate | Invalidation cost | Best for |
|---|---|---|---|---|---|
| Per-start-concept LRU | Cache (concept_id, descendants[]) with LRU eviction | 20–50 MB | High for hot concepts | Cheap | Recommended default |
| Bounded-depth materialized table | Pre-compute closure to depth 3 or 4 only | 5–10× edges count | Excellent for typical UI traversal | Recompute affected subtrees | Static or rarely edited ontologies |
| Full materialized closure | Standard SQL closure table | k × edges (5–15×) | 100% | Catastrophic on edit | Read-only imports |
| Lazy frontier expansion | Compute closure on demand, memoize one BFS frontier at a time | <10 MB | Low; recomputes a lot | None | Mobile / low-end |
| Per-ontology partition + lazy | Each imported ontology gets its own attached sqlite db | bounded per ontology | High | Localized | Recommended for multi-OBO use case |
The partition + per-start-concept LRU + bounded depth combination is what every production ontology browser converges on, and it is what Crosswalker should adopt.
Details
Section titled “Details”Re-run of the Ch 18 model under ontology-web framing
Section titled “Re-run of the Ch 18 model under ontology-web framing”The original Ch 18 calculation (5K controls × 8 frameworks ≈ 40K cells, ≈ 100K mappings as the “comfortable ceiling”) was a cell-count model for GRC. Under ontology-web framing the numerics change but the conclusion holds:
- The dominant Crosswalker workload is mappings between framework elements, not the structure of the source ontologies themselves. A user importing GO to map their domain notes to GO terms cares about ~50 mappings they create, not the 88K internal relations of GO.
- The internal ontology structure becomes a read-only dependency queried via 1-hop expansion, label search, and bounded-depth closure — all of which scale to GO and HPO on plain sqlite-wasm without trouble.
- The 100K-mapping ceiling therefore re-applies as the ceiling on user-authored crosswalk edges, not the count of imported ontology axioms.
Where ontology-web framing does add new pressure
Section titled “Where ontology-web framing does add new pressure”- Imported source-of-truth size. A GRC vault of 5K controls is ~5K rows in
concepts. A single GO import is 50K. A speculative SNOMED import is 370K. - Closure depth distribution. GRC frameworks are shallow (2–4 levels). OBO ontologies are routinely 10–15 levels deep, and SNOMED CT can reach 20+. Bounded-depth caches need a configurable depth, not a hard-coded 4.
- Cycle handling. ChEBI explicitly contains cyclic relationships, GO contains DAG multi-parenting. The recursive CTE must use
UNION(notUNION ALL) or carry a path to terminate.
Browser memory limits — concrete numbers
Section titled “Browser memory limits — concrete numbers”- Chrome wasm32: theoretical 4 GB per WASM instance; practical fragmentation ceiling ~2–3 GB.
- Firefox/Safari: similar 4 GB ceiling.
- Electron 14+: ~8 GB hard cap per process.
- Obsidian: V8 pointer compression / “memory cage” caps the entire renderer at ~4 GB, shared across the editor, all plugins, the file cache, the graph view, etc.
- OPFS quotas: per origin; Chrome typically grants gigabytes but Windows storage-cleanup heuristics have been observed to silently delete OPFS data when free disk is low.
Why we are not adding sqlite-vec, libSQL, Turso, or Limbo
Section titled “Why we are not adding sqlite-vec, libSQL, Turso, or Limbo”- sqlite-vec: ontology-web crosswalks are graph-and-label workloads, not embedding workloads.
- libSQL/Turso/Limbo: rejected in Ch 24 for ecosystem-maturity and Obsidian-distribution reasons.
Tier 3 offload UX (for when users hit the wall)
Section titled “Tier 3 offload UX (for when users hit the wall)”When triggers #6 or #7 fire, the user experience must be:
- A clear “this query exceeds your local budget” message naming the offending primitive.
- A choice to either (a) bound the query, (b) switch the affected ontology to a server-side endpoint, or (c) leave it in browser and accept a long-running cancellable job in a worker.
- No silent crash. Tripwire detection (estimated row counts before execution; abort with friendly error) is mandatory.
Verdict on the original Ch 18 ~100K ceiling: REAFFIRMED
Section titled “Verdict on the original Ch 18 ~100K ceiling: REAFFIRMED”- For user-authored crosswalk mappings: ~100K is the correct desktop comfort ceiling and ~20–30K is the correct mobile ceiling.
- For imported ontology source data: the ceiling is better expressed as ~250K concepts / ~2M edges / ~3M closure rows in a single active query scope (the “Yellow” tier above).
- The ceiling is therefore reaffirmed in spirit, re-expressed in scope-appropriate units, and not revised downward or deferred.
Recommendations
Section titled “Recommendations”Immediate (v0.1.7)
Section titled “Immediate (v0.1.7)”- Ship per-ontology partitioning. Each imported ontology becomes its own logical namespace.
- Add tripwire estimation before any closure or pivot query. Cheap
COUNT(*)against the relevant edge subset; if exceeds threshold (default 250K), abort with friendly error. - Default recursive CTE depth limit to 6 with a user-overridable per-ontology setting.
- Add a per-ontology “closure cache” table (materialized, bounded depth) populated on import for ontologies under the Yellow threshold.
Near-term (v0.1.8–v0.2)
Section titled “Near-term (v0.1.8–v0.2)”- Add an LRU bounded closure cache keyed on
(start_concept_id, depth)with a configurable memory budget. - Implement an explicit “Tier 3 referral” mechanism.
- Mobile mode: detect Obsidian Mobile, halve all memory budgets.
Trigger thresholds (when to revisit this decision)
Section titled “Trigger thresholds (when to revisit this decision)”- Add Tier 3 referral now if any user reports OOM with an OBO-bundle import larger than 250K concepts.
- Re-evaluate vector layer only if SSSOM-style label-similarity mapping suggestions become a top-3 user request and sqlite-vec ships a stable WASM build.
- Re-evaluate Memory64 path when Obsidian’s Electron embeds it by default.
- Force partition-default if any user reports query-cancellation frustration on multi-OBO bundles.
Anti-recommendations (to lock in)
Section titled “Anti-recommendations (to lock in)”- Do not preemptively migrate off sqlite-wasm.
- Do not load full BioPortal or full UMLS.
- Do not add a vector layer in v0.1.x — it is a different product.
- Do not assume desktop budgets on mobile.
Caveats
Section titled “Caveats”- Closure latency benchmarks above are estimates extrapolated from native-SQLite benchmarks and the SNOMED Intl. community reports (which are based on Postgres/MySQL/Oracle, not sqlite-wasm). Browser-side numbers will be 2–5× slower than native sqlite.
- The “Yellow” and “Orange” thresholds are budgets, not measurements.
- Mobile numbers (the ÷5 rule of thumb) are the softest figures here.
- NIST OLIR scale figures were not available as a precise count from NIST itself.
- The SNOMED CT closure size of 6.5M rows scales with SNOMED’s growth and would need updating per release.
- OPFS durability: there is an open Chromium issue (chromium #443206423) about Windows storage-cleanup deleting OPFS data.
- Recursive CTE in a Web Worker is essential but introduces postMessage serialization cost on result return.
- The Obsidian “memory cage” 4 GB renderer cap is shared with everything else in Obsidian. Real Crosswalker budget on a vault with 15+ plugins is closer to 1–1.5 GB, not 4 GB.