Skip to content
🚧 Early alpha — building the foundation. See the roadmap →

Challenge 34: Streaming / chunked query execution at ontology-web scale

Created Updated

Crosswalker’s v0.1.4.5 streaming refactor handled IMPORT-side streaming: large CSVs are processed chunk-by-chunk into Tier 1 files. The QUERY side does NOT stream — plugin.queryClosure() runs a recursive CTE that materializes the entire result set in memory. At small-vault scale (current GRC bound) this is fine. At ontology-web scale (BioPortal: 700+ ontologies; UMLS: 3.5M concepts; OBO Foundry: hundreds × thousands), it isn’t.

The user’s specific concern: “How does merging scale chunk-by-chunk to not blow memory?” This is a streaming / iterative-merge concern that prior challenges never asked.

AssetWhat it gives us
v0.1.4.5 streaming refactorIMPORT-side streaming pattern
Tier 2 sqlite-wasm sidecarIn-memory database; OPFS-persisted; recursive CTE for closure
plugin.queryClosure()Currently materializes full closure in memory
Closure cacheLazy materialization; cached results don’t re-execute, but the cached result IS in memory

For each of the following, document the streaming model:

  • DuckDB out-of-core / OOC — spill-to-disk for result sets exceeding RAM
  • Polars streaming (https://docs.pola.rs/user-guide/concepts/streaming/) — lazy evaluation + chunked execution; pipeline operators
  • Apache DataFusion — chunked execution over Arrow record batches
  • Apache Arrow Flight — streaming protocol for analytical query results
  • Materialize — incremental view maintenance (results recompute only what changed)
  • Datomic — immutable indexes; queries iterate rather than materialize
  • ClickHouse — streaming aggregates over massive tables
  • CRDT-based DBs (Yjs, Automerge) — eventually-consistent merging
  • Iterator / generator patterns — JS / TS streaming idioms

For each: how does it compose with relational + graph + vector primitives? At what scale does it pay off?

2. Ontology-web specific streaming patterns

Section titled “2. Ontology-web specific streaming patterns”

Some queries don’t have a natural chunked execution. Specifically:

  • Closure (transitive reachability) — depth-first vs breadth-first; can it stream incrementally as the frontier expands?
  • Pivot — group-by-two-axes; does it require materializing the full input or can it stream by pre-sorting?
  • Anti-join — “X without Y” — needs both sides; can it stream by hashing?
  • Multi-ontology join — joining concepts across 5+ ontologies; can it stream by ontology?

For each query primitive, characterize: streamable? requires-buffer? requires-materialization?

3. Crosswalker-specific streaming patterns

Section titled “3. Crosswalker-specific streaming patterns”

Apply the survey to Crosswalker’s actual queries. For each of these typical ontology-web queries, design a streaming execution path:

  • Closure of MITRE ATT&CK techniques reachable from a NIST 800-53 control via 3+ hops
  • Coverage matrix for NIST CSF × NIST 800-53 × ISO 27001 × CIS Controls (4-way pivot)
  • Anti-join: SKOS subjects with no LCSH equivalent
  • OBO Foundry: gene-ontology terms with no MONDO disease mapping (anti-join across 100K+ concepts)

For each: chunked execution plan, memory profile, worst-case latency at ontology-web scale.

sqlite-wasm runs in browser/Electron. Streaming faces specific WASM constraints:

  • No threads (without SharedArrayBuffer); can’t parallelize chunks across cores
  • Memory-bounded (linear memory; default 4GB cap on 32-bit WASM)
  • I/O happens through async OPFS or main-thread bridges

What streaming patterns work in WASM specifically? What’s blocked by WASM constraints? When do we have to drop to native (non-WASM, server-side) execution?

On Obsidian Mobile, RAM is even more constrained. Streaming becomes more important AND more constrained:

What’s the mobile streaming story? Is it “queries above N rows fail on mobile, with graceful error message”?

6. Incremental view maintenance vs full re-query

Section titled “6. Incremental view maintenance vs full re-query”

Materialize’s pattern (incremental view maintenance) is appealing: queries don’t re-run from scratch when the underlying data changes; only the delta is processed. Could Crosswalker adopt this pattern?

  • For Bases-rendered queries: Bases re-runs on every render; not naturally incremental
  • For materialized snapshots (v0.1.8): incremental updates would replace full regeneration
  • For closure cache: invalidation on any mappings change → full recompute; could be incremental if we tracked which mappings changed

Argue: where’s the v0.2+ payoff for incremental?

Crosswalker has 3 tiers (T1 Markdown, T2 sqlite-wasm sidecar, T3 server-side). Streaming spans tiers:

  • T1 → T2 projection: already streamed (v0.1.4.5)
  • T2 query → user surface: NOT streamed currently
  • T2 → T3 federation (when T3 is enabled): streaming required

What’s the cross-tier streaming protocol? Is it Arrow Flight? GraphQL subscriptions? Custom?

The deliverable must NOT recommend:

  1. Premature streaming optimization — at small/medium scale, materialization is fine. Don’t add streaming complexity unless ontology-web scale forces it
  2. Streaming for primitives that fundamentally need materialization (full anti-join with arbitrary predicates) — argue when streaming is impossible
  3. Forking sqlite-wasm to add streaming — out of scope
  4. Migrating off sqlite-wasm to enable streaming without a strong scale-driven justification (this is a Ch 33 question)
  5. Streaming patterns that don’t work on Obsidian Mobile — must work or degrade gracefully
  6. Speculative incremental-view-maintenance without a use case — IVM is heavy; needs concrete payoff
  7. Async patterns that block the main thread — Obsidian’s UI must stay responsive

The deliverable must produce:

  1. Streaming-pattern survey — 7+ engines × dimensions (model / chunked / incremental / out-of-core / WASM-feasible)
  2. Primitive-by-primitive streamability — for each of the 7 query primitives: streamable / partial / requires-materialization
  3. Crosswalker query streaming plans — concrete chunked execution paths for 4 representative queries
  4. WASM streaming feasibility analysis — what works / what’s blocked / native-fallback path
  5. Mobile streaming story — graceful degradation / error UX
  6. Incremental view maintenance verdict — where it pays off in Crosswalker; v0.2+ vs deferred
  7. Cross-tier streaming protocol — T1↔T2↔T3 streaming spec (or argument for sync-only)
  8. Recommended v0.1.7+ implementation — minimum streaming surface; deferred items

Project context:

Streaming-execution references:

Adjacent Crosswalker challenges:

Write the deliverable to docs/.../zz-research/YYYY-MM-DD-challenge-34-deliverable-a-<slug>.md. After deliverable lands: update synthesis log §9 status Ch 34 row from ⏳ to ✅; update v0.1.7 milestone scope with streaming patterns; archive this brief.