Skip to content
🚧 Early alpha — building the foundation. See the roadmap →

Ch 33: Multi-modal query engine landscape audit (v0.1.7+)

Created Updated

Crosswalker Challenge 33: Multi-Modal Query Engine Landscape Audit (v0.1.7+)

Section titled “Crosswalker Challenge 33: Multi-Modal Query Engine Landscape Audit (v0.1.7+)”
  • Stay on @sqlite.org/sqlite-wasm for v0.1.7. No surveyed engine simultaneously satisfies (a) the Obsidian Mobile / Capacitor constraint (no SharedArrayBuffer, no OPFS sync handles, async-only file I/O), (b) a small WASM bundle, and (c) a permissive license. The ontology-web framing changes scale aspirations but does not change the mobile-execution physics that drove the original Tier 2 commitment.
  • The decoupled vector layer (sqlite-vec) survives the re-audit, but with a known packaging cost. sqlite-vec must be statically compiled into a custom SQLite WASM build (it cannot be dynamically loaded), which means the moment Crosswalker turns on vectors, the substrate stops being stock @sqlite.org/sqlite-wasm. This is the single most concrete migration trigger and it shifts earlier, not later, under ontology-web framing.
  • For ontology-web scale (UMLS = 3.49M concepts, 17.39M atoms in 2025AB; BioPortal = 1,549 ontologies / 15.3M classes / 100M+ mappings; OBO Foundry = hundreds of ontologies) the realistic answer is server-side / out-of-band, not in-vault. Crosswalker’s in-process query engine should target the “imported slice” (≤1M concepts, ≤500K mappings) — a vault scale at which sqlite-wasm with closure tables (not naive recursive CTEs) is competitive. Full-corpus crosswalking belongs behind an HTTP boundary (BioPortal API, Oxigraph-CLI sidecar, or Fuseki) and is out of scope for v0.1.x.

  1. Mobile is the binding constraint, not scale. Capacitor’s WebKit/WebView on iOS and Android does not expose SharedArrayBuffer (Meltdown/Spectre mitigations require COOP/COEP headers that Capacitor’s capacitor://localhost scheme cannot set). This is documented by Capacitor maintainers and reproduced by multiple Obsidian plugin authors (e.g., obsidian-typst, obsidian-note-linker). Stock @sqlite.org/sqlite-wasm requires SAB+OPFS for its high-performance VFS; on Obsidian Mobile it must fall back to in-memory or a non-OPFS VFS. Every WASM-compiled candidate engine inherits the same constraint. Therefore the substrate question is not “which engine is fastest” but “which engine degrades acceptably without SAB.”

  2. Ontology-web framing reframes scale but not the deliverable. UMLS 2025AB is 3,488,973 concepts / 17,390,109 atoms / 190 sources. BioPortal (May 2025) hosts 1,549 ontologies (1,182 public) totalling 15,293,440 terms with 100M+ cross-ontology mappings. OBO Foundry hosts hundreds of ontologies, several with 10K+ classes. None of this fits in browser memory (WASM32 has a 4GB linear-memory ceiling; mobile WebViews realistically tolerate ≤256–512MB before pressure). Crosswalker’s vault must host a subset — the user’s imported ontologies — not the full ontology web.

  3. The closure-query bottleneck is real but solvable in SQLite. Recursive CTEs in SQLite (and DuckDB) materialize intermediate routing tables and degrade super-linearly past ~500K nodes. The well-established fix is a closure table (every ancestor-descendant pair pre-computed) or materialized path column, both of which turn O(depth × fanout) recursive walks into single-index range scans. This is a recipe-schema decision, not an engine decision; it preserves engine-neutrality.

  4. Cozo is the strongest theoretical alternative — and the strongest cautionary tale. It is the only surveyed engine that natively unifies Datalog + relational + graph + HNSW vectors in a WASM build that runs on phones, with HNSW vector search built into Datalog. But it requires nothread + wasm features (single-threaded), its WASM bundle is large (RocksDB optional, Sled discouraged), and the project tempo has slowed. Adopting Cozo would replace one vendor risk (sqlite-wasm, Mozilla-Builders-supported sqlite-vec) with another (a smaller-community Rust project).

  5. DuckDB-WASM + DuckPGQ is attractive for analytic workloads, prohibitive for v0.1 mobile. DuckDB-WASM ships ~3.2MB of compressed WASM for the shell+core extensions, single-threaded by default (multi-threading needs SAB), and core extensions (Parquet, JSON, ICU) are autoloaded over the network. DuckPGQ remains explicitly described by its lead author as “a research project and still a work in progress.” This violates the v0.1 mobile-readiness bar.

  6. Oxigraph-WASM is the cleanest “real SPARQL in browser” option but is in-memory only. Its WASM build disables RocksDB by default; the in-memory Store class supports SPARQL 1.1 query/update. Reported memory pressure is severe (a Github issue documents ~60GB RAM to import a ~1GB Turtle dataset on the server build), and benchmarks suggest the WASM build is competitive with Comunica-class JS engines but well below Java triple stores on bulk loads.

  7. Decoupled vector layer (sqlite-vec) decision still holds, but the static-link requirement reframes “decoupled.” sqlite-vec author Alex Garcia explicitly states: “It’s not possible to dynamically load a SQLite extension into a WASM build of SQLite. So sqlite-vec must be statically compiled into custom WASM builds.” This means turning on vectors in Crosswalker requires forking off @sqlite.org/sqlite-wasm to a custom build (the sqlite-vec-wasm-demo NPM package, or an in-house equivalent). The decoupling is logical (separate query primitive, separate index) but not packaging-decoupled.

  8. No commercial-OSS engine survives the anti-pattern filter. Stardog (proprietary; Free is 1-year renewable license, not OSS), Datomic (free since 2023 but Cognitect/Nubank-controlled, server-only, JVM-only), GraphDB (Free has 2-concurrent-query limit and is “commercial \& free to use,” not OSS), Neo4j (GPLv3 Community + commercial Enterprise, with active enforcement against AGPL-removal), HelixDB (AGPLv3 — viral for Crosswalker plugin distribution), SurrealDB (BSL with 4-year delay to Apache 2.0). All are excluded for Crosswalker’s anti-patterns 4 (vendor concentration) and/or violate AGPL/license-compatibility for Obsidian plugin distribution.


SECTION 1 — Engine Survey Matrix (16 engines × 8 dimensions)

Section titled “SECTION 1 — Engine Survey Matrix (16 engines × 8 dimensions)”
#EngineScale ceilingQuery modelMulti-paradigmEmbedded/WASMStreaming/OOCLicenseMobile (Capacitor)Vendor risk
1DuckDB + DuckPGQBillions of rows on disk; ~10–100M tractable in WASM with OPFSSQL + SQL/PGQ (SQL:2023), recursive CTEs, USING KEYRelational + columnar + property-graph extension; vector via community extWASM build (~3.2MB compressed shell), single-threaded by default; multi-thread needs SABOut-of-core spill native; vectorized push-based pipelineMITPoor — multi-threading needs SAB; OPFS works on desktop but not mobile WebViewLow (DuckDB Foundation, broad adoption)
2Polars100M+ rows in memory; sink_parquet for OOCPolars expression DSL, lazy plan; SQL via contextColumnar/Arrow; not graph; no native vectorRust+Python+JS; no production browser WASM build for full engineNew streaming engine (1.31+) handles >RAM via batch; sink_* writes incrementallyMITNot viable — no first-class browser/mobile targetLow (Polars team, NL VC-backed)
3CozoOLTP ~1.6M rows at 100K QPS on Mac Mini; OLAP ~1s on similar; mobile-embedded supportedDatalog (CozoScript)Relational + graph + HNSW vectors + time-travelWASM build with wasm + nothread features; phone-embedded supported; SQLite/RocksDB/in-mem backendsRocksDB backend OOC; in-mem boundedMPL-2.0Possible — runs in browser WASM; mobile via React-Native bindings; but COOP/COEP still needed for full perfMedium — small core team, slowed tempo
4Oxigraph-WASMIn-memory only in WASM (RocksDB disabled); empirically <1M quads comfortable, several M strainedSPARQL 1.1 Query/Update; Federated QueryPure RDF triple/quad storeApache-2.0 / MIT Rust→WASM via wasm-bindgen; NPM package; SPARQL 1.2 + RDF 1.2 in 0.5+None in WASM (in-memory)Apache-2.0 / MITPlausible — pure WASM, no SAB required for in-memory modeLow (independent maintainer Tpt, used by Wikidata-adjacent projects)
5Stardog50B triples single-nodeSPARQL + path queries + GraphQL + virtual graphsRDF + virtualization + MLJVM only — serverDisk-backedProprietary (Free = 1-yr renewable, not OSS)Not viableHigh — single vendor, paywall
6RDF4JTens of billions of triples (LMDB/Native store); enterprise scale via GraphDBSPARQL 1.1, SeRQL legacy, SHACLRDF + SHACL + GeoSPARQLJVM only — no WASMNative store paged from diskEDL 1.0 (Eclipse)Not viableLow (Eclipse Foundation)
7Apache Jena + FusekiTDB2 to 10B+ triples; Fuseki HTTPSPARQL 1.1RDF + OWL inference + Lucene textJVM onlyTDB2 disk-pagedApache-2.0Not viable in-process; viable as sidecarLow (Apache Foundation)
8MaterializeBillions of rows; cluster-scalePostgreSQL-dialect SQL with IVMStreaming SQL + IVM (differential dataflow)Server only — Rust binary, K8sDifferential dataflow nativeBSL → Apache 2.0Not viable in-processMedium (Materialize Inc., commercial-OSS)
9DatomicMulti-billion datoms (Nubank: 2.5B txns/day)Datalog + PullImmutable log + indexes; entity-attribute-valueJVM peer/client; Datomic Local is Apache-2.0 lib (still JVM)Disk-backedApache-2.0 since 2023, but JVM-only and Nubank/Cognitect-controlledNot viableMedium-high — single steward; no browser path
10ClickHouseMulti-TB+, embeddings supported via vector_similarity index (BFloat16 quantization)SQL (ClickHouse dialect)Columnar OLAP + vectors + textServer-first; embedded mode emerging (clickhouse-local) but not browser WASMDisk-backed; multi-TB embeddingsApache-2.0Not viable in-processLow (broad adoption, Apache 2.0)
11LanceDB200M+ vectors in production; billion-scale targetedLance SQL subset + vector opsLance columnar format + vectors + FTSEmbedded Rust + Python/TS; no first-class browser WASM target documentedDisk-based indexes; cloud-nativeApache-2.0Not viable today; potential future via WASM compilationLow (Apache 2.0; commercial cloud)
12TerminusDB”In-memory graph DBMS” on succinct datastructures + delta encoding (git-for-data); reasonable up to tens of millions of triplesWOQL Datalog + GraphQL + RESTRDF + JSON-LD documents + git-style versioningServer only (Prolog + Rust); no browser WASMIn-memory layered storeApache-2.0 (since v10); maintained by DFRNT since 2025Not viable in-process; possible as sidecarMedium — small steward (DFRNT), narrow community
13Neo4jHundreds of billions of nodes (enterprise)Cypher / GQL (ISO)Native property graphJVM server; no WASMDisk-backedGPLv3 (Community) + commercial Enterprise, with PureThink lawsuit precedentNot viableHigh — license risk for plugin redistribution
14GraphDBTens of billions of triples (Standard/Enterprise); 50–250M (DBaaS); Free has 2-concurrent-query capSPARQL + SHACL + RDF inferenceRDF + OWL2-RLJVM serverDisk-backedCommercial-free (Free is “commercial \& free”, not OSS); SE/EE paidNot viableHigh — single vendor (Ontotext)
15HelixDBPre-1.0; OLTP graph+vector via LMDB; uses HelixQLCompiled HelixQL (Gremlin-influenced)Graph + vector + KV + relationalRust + Python/TS clients; no browser WASM targetLMDB-backedAGPLv3 (viral for Crosswalker plugin)Not viableHigh — pre-1.0, viral license, single startup
16SurrealDBDistributed scale; v3.0 includes vector + graph + document + time-series in one ACID transactionSurrealQLDocument + graph + vector + time-series + KVEmbedded Rust + WASM possible; full feature requires serverTiKV / SurrealKV / RocksDB backendsBSL 1.1 → Apache 2.0 after 4 years (per release)Possible technically; license is non-OSS during BSL windowMedium-high — already rejected per Ch 24
EngineSmall (10K / 5K)Medium (100K / 50K)Large (1M / 500K, OLIR-scale)Ontology-web (10M+, UMLS-scale)
sqlite-wasm (current)<50ms/closure with index; trivial200–500ms naive recursive CTE; <50ms with closure table5–30s naive; 100–500ms with closure table + materialized path; vault DB ~200–500MBBreaks — closure-table size grows quadratically; OPFS quota/perf becomes binding; JS heap pressure; single-threaded
DuckDB-WASM + PGQTrivial; arguably overkill50–200ms (vectorized)500ms–2s with USING KEY recursive CTE; OOC spill worksPlausible at 1–10M with disk; >10M unreliable in WASM (4GB linear memory cap)
Polars (Node)TrivialSub-secondStreaming engine handles 1M+ relational; no native graph closure — must implement iterativelyPossible for tabular crosswalk export; closure must be hand-rolled
Cozo (WASM)Trivial; Datalog magic-set rewrites100–500ms~1–5s for transitive closure (Datalog semi-naive eval)Plausible at 1–10M with RocksDB backend on desktop; WASM in-mem bounded
Oxigraph-WASMTrivialAcceptableStrained at 1M quads in WASM in-mem; SPARQL property paths workNot viable in WASM; native build viable
LanceDBTrivial vectors100K vectors, no index needed (<100ms brute force per docs)1M+ with disk-based index200M+ vectors documented in production
ClickHouseOverkillOverkillMulti-TB embeddings supported with vector_similarity indexNative fit; not browser-runnable
Apache Jena/Fuseki (sidecar)OverkillEasyEasy on JVMEasy at 100M+ triples on commodity server
GraphDB FreeEasyEasyEasy (Free has no triple cap, only concurrent-query cap of 2)Tens of billions claimed; license non-OSS
Materialize / Datomic / Stardog / Neo4j / TerminusDB / HelixDB / SurrealDBServer-only — N/A in-vaultServer-onlyServer-onlyServer-only

Tier breakpoints:

  • Small (≤10K concepts): Every engine works. Decision driven by mobile / packaging, not perf.
  • Medium (≤100K): sqlite-wasm + closure table is sufficient. Vector becomes the differentiator.
  • Large (≤1M, ~OLIR-scale): sqlite-wasm + closure table + materialized path is still sufficient; DuckDB or Cozo would be 2–5× faster but require migration.
  • Ontology-web (≥10M): All in-process WASM engines break. This tier requires a server-side companion (Fuseki/Oxigraph-CLI/GraphDB) or cloud API (BioPortal). Not v0.1 territory.

SECTION 3 — Vector Layer + Multi-Modal Composition

Section titled “SECTION 3 — Vector Layer + Multi-Modal Composition”

Verdict: keep the decoupled vector layer; accept the static-link cost as a known v0.1.x packaging liability.

  • sqlite-vec: Pure C, no deps, 32-bit/8-bit/binary vector types, vec0 virtual table; runs in WASM but must be statically compiled in.
  • LanceDB: No first-class browser WASM.
  • DuckDB + vector extensions: Compelling but inherits DuckDB’s bundle and SAB requirement.
  • ClickHouse + vector_similarity: Server-side only.
  • Chroma/Qdrant/Faiss-WASM: Not viable in-process.

SECTION 4 — Migration Trigger Updates (revised)

Section titled “SECTION 4 — Migration Trigger Updates (revised)”
#Original trigger (Ch 24)Ontology-web reframeUrgency
1Vector extension packagingsqlite-vec must be statically linked into WASM buildIncreased. When vectors land, fork the WASM build.
2WASM bundle sizeStock sqlite-wasm is fine; DuckDB-WASM is too big for v0.1.xUnchanged for sqlite-wasm; blocker for DuckDB/Cozo path.
3Closure query latencyAt ≤1M concepts with closure-table schema, sqlite-wasm is sub-secondReduced for vault-scale; decisive for ontology-web (forces sidecar).
4Mobile / low-end performanceMobile constraint hardensIncreased. Need explicit mobile-mode query throttle.
5Federation requirementCross-ontology query at ontology-web scale forces federationIncreased. Plan an HTTP/SPARQL federation lane.

New trigger surfaced by this audit:

  • Trigger 6 — sqlite-vec-wasm packaging stability. Track upstream sqlite-vec-wasm-demo package status; if it stagnates, build in-house.

SECTION 5 — Substrate-Neutral Architecture Verification

Section titled “SECTION 5 — Substrate-Neutral Architecture Verification”

The three named primitives (getConceptsByOntology, crosswalkBetween, closureFromConcept) are conceptually substrate-neutral; their implementations are not.

Conclusion: The primitives are abstractable. The risk is recipe authors writing engine-specific SQL inside recipes. Recommendation: lock recipes to a small declarative subset (named primitive calls + JSON-shaped predicates).

SECTION 6 — Re-audit of Prior Commitments

Section titled “SECTION 6 — Re-audit of Prior Commitments”
ChapterOriginal verdictOntology-web re-auditStatus
Ch 10 (graph→tabular bridging)Junction-table model with closure tableAt ontology-web scale, native RDF retains advantages; at vault scale (≤1M), tabular wins on mobileREAFFIRMED for vault scale; DEFERRED for ontology-web sidecar
Ch 11 (Tier 2/3 engine survey)sqlite-wasm Tier 2; Fuseki Tier 3Oxigraph-CLI is now stronger as Tier-3 default than FusekiREVISED
Ch 12 (Datalog vs SQL)SQL chosen for recipe surfaceSQL+CTE good enough at vault scaleREAFFIRMED
Ch 14 (missed engines)(Various adds)DuckPGQ research; HelixDB AGPL/pre-1.0; Materialize server-only; LanceDB no browser WASMREAFFIRMED
Ch 16 (Tier 3 reconsideration)Fuseki defaultOxigraph-CLI sidecar is now a peer optionREVISED
Ch 18 (Tier 2-lite scope)Bounded primitivesStrengthens the case for bounded primitivesREAFFIRMED
Ch 24 (Turso/libSQL evaluation)RejectedAnti-pattern 3 holdsREAFFIRMED

Stage 1 — v0.1.7 (commit now):

  1. Stay on @sqlite.org/sqlite-wasm. No substrate change.
  2. Adopt closure-table + materialized-path schema for the concept_closure relation.
  3. Codify the three query primitives as the only sanctioned recipe entry points.
  4. Defer sqlite-vec to v0.1.8 or v0.1.9.
  5. Document the mobile constraint explicitly.

Stage 2 — v0.1.8 (when triggered): 6. When vectors turn on, fork the WASM build. 7. Add an Oxigraph-CLI sidecar profile as an alternative Tier-3 alongside Fuseki.

Stage 3 — v0.2 (when triggered): 8. Add a federation lane for cross-ontology queries that exceed vault scale. 9. Re-evaluate Cozo annually.

Concrete next actions for v0.1.7 milestone:

  • Land closure-table migration in concept_closure schema; update all three primitives.
  • Add a MobileMode: \{ maxClosureDepth, maxRowsPerQuery, asyncOnly: true \} config and enforce in plugin.queryClosure().
  • Add an architecture-decision-record (ADR) titled “ADR-33: Substrate stays sqlite-wasm; vectors deferred; mobile is the binding constraint.”
  • Add a recipe-lint rule: reject recipes containing raw SQL outside primitive parameters.
  • Add CI step: micro-benchmark closureFromConcept on a 100K-concept synthetic vault.

  • Mobile constraint sources: Multiple Obsidian plugin authors and Capacitor maintainers report SharedArrayBuffer unavailable on Capacitor.
  • DuckPGQ status: Research project / community extension; “the SQL/PGQ syntax requires a - at the start of the query when building from source, otherwise you will experience a segmentation fault.”
  • Oxigraph WASM memory profile: The 60GB-RAM-for-1GB-Turtle bug report is for the server build with RocksDB.
  • BioPortal/UMLS/OBO scale numbers are 2025 figures.
  • License risk for plugin redistribution: Linking AGPL code (HelixDB, some Neo4j components) into a redistributed plugin triggers AGPL obligations.
  • DuckDB recursive CTE performance evidence is mixed.
  • Datomic Local under Apache 2.0 is interesting but JVM-only.
  • Anti-pattern compliance was strictly applied.