Challenge 34: Streaming / chunked query execution at ontology-web scale
Why this exists
Section titled “Why this exists”Crosswalker’s v0.1.4.5 streaming refactor handled IMPORT-side streaming: large CSVs are processed chunk-by-chunk into Tier 1 files. The QUERY side does NOT stream — plugin.queryClosure() runs a recursive CTE that materializes the entire result set in memory. At small-vault scale (current GRC bound) this is fine. At ontology-web scale (BioPortal: 700+ ontologies; UMLS: 3.5M concepts; OBO Foundry: hundreds × thousands), it isn’t.
The user’s specific concern: “How does merging scale chunk-by-chunk to not blow memory?” This is a streaming / iterative-merge concern that prior challenges never asked.
What we already have
Section titled “What we already have”| Asset | What it gives us |
|---|---|
| v0.1.4.5 streaming refactor | IMPORT-side streaming pattern |
| Tier 2 sqlite-wasm sidecar | In-memory database; OPFS-persisted; recursive CTE for closure |
plugin.queryClosure() | Currently materializes full closure in memory |
| Closure cache | Lazy materialization; cached results don’t re-execute, but the cached result IS in memory |
What to investigate
Section titled “What to investigate”1. Survey streaming-execution patterns
Section titled “1. Survey streaming-execution patterns”For each of the following, document the streaming model:
- DuckDB out-of-core / OOC — spill-to-disk for result sets exceeding RAM
- Polars streaming (https://docs.pola.rs/user-guide/concepts/streaming/) — lazy evaluation + chunked execution; pipeline operators
- Apache DataFusion — chunked execution over Arrow record batches
- Apache Arrow Flight — streaming protocol for analytical query results
- Materialize — incremental view maintenance (results recompute only what changed)
- Datomic — immutable indexes; queries iterate rather than materialize
- ClickHouse — streaming aggregates over massive tables
- CRDT-based DBs (Yjs, Automerge) — eventually-consistent merging
- Iterator / generator patterns — JS / TS streaming idioms
For each: how does it compose with relational + graph + vector primitives? At what scale does it pay off?
2. Ontology-web specific streaming patterns
Section titled “2. Ontology-web specific streaming patterns”Some queries don’t have a natural chunked execution. Specifically:
- Closure (transitive reachability) — depth-first vs breadth-first; can it stream incrementally as the frontier expands?
- Pivot — group-by-two-axes; does it require materializing the full input or can it stream by pre-sorting?
- Anti-join — “X without Y” — needs both sides; can it stream by hashing?
- Multi-ontology join — joining concepts across 5+ ontologies; can it stream by ontology?
For each query primitive, characterize: streamable? requires-buffer? requires-materialization?
3. Crosswalker-specific streaming patterns
Section titled “3. Crosswalker-specific streaming patterns”Apply the survey to Crosswalker’s actual queries. For each of these typical ontology-web queries, design a streaming execution path:
- Closure of MITRE ATT&CK techniques reachable from a NIST 800-53 control via 3+ hops
- Coverage matrix for NIST CSF × NIST 800-53 × ISO 27001 × CIS Controls (4-way pivot)
- Anti-join: SKOS subjects with no LCSH equivalent
- OBO Foundry: gene-ontology terms with no MONDO disease mapping (anti-join across 100K+ concepts)
For each: chunked execution plan, memory profile, worst-case latency at ontology-web scale.
4. WASM-side streaming feasibility
Section titled “4. WASM-side streaming feasibility”sqlite-wasm runs in browser/Electron. Streaming faces specific WASM constraints:
- No threads (without SharedArrayBuffer); can’t parallelize chunks across cores
- Memory-bounded (linear memory; default 4GB cap on 32-bit WASM)
- I/O happens through async OPFS or main-thread bridges
What streaming patterns work in WASM specifically? What’s blocked by WASM constraints? When do we have to drop to native (non-WASM, server-side) execution?
5. Mobile streaming
Section titled “5. Mobile streaming”On Obsidian Mobile, RAM is even more constrained. Streaming becomes more important AND more constrained:
- iOS WebView memory caps
- Capacitor lacks SharedArrayBuffer (per Ch 24 / WASM-A pivot)
What’s the mobile streaming story? Is it “queries above N rows fail on mobile, with graceful error message”?
6. Incremental view maintenance vs full re-query
Section titled “6. Incremental view maintenance vs full re-query”Materialize’s pattern (incremental view maintenance) is appealing: queries don’t re-run from scratch when the underlying data changes; only the delta is processed. Could Crosswalker adopt this pattern?
- For Bases-rendered queries: Bases re-runs on every render; not naturally incremental
- For materialized snapshots (v0.1.8): incremental updates would replace full regeneration
- For closure cache: invalidation on any mappings change → full recompute; could be incremental if we tracked which mappings changed
Argue: where’s the v0.2+ payoff for incremental?
7. Cross-tier streaming composition
Section titled “7. Cross-tier streaming composition”Crosswalker has 3 tiers (T1 Markdown, T2 sqlite-wasm sidecar, T3 server-side). Streaming spans tiers:
- T1 → T2 projection: already streamed (v0.1.4.5)
- T2 query → user surface: NOT streamed currently
- T2 → T3 federation (when T3 is enabled): streaming required
What’s the cross-tier streaming protocol? Is it Arrow Flight? GraphQL subscriptions? Custom?
Anti-patterns to reject upfront
Section titled “Anti-patterns to reject upfront”The deliverable must NOT recommend:
- Premature streaming optimization — at small/medium scale, materialization is fine. Don’t add streaming complexity unless ontology-web scale forces it
- Streaming for primitives that fundamentally need materialization (full anti-join with arbitrary predicates) — argue when streaming is impossible
- Forking sqlite-wasm to add streaming — out of scope
- Migrating off sqlite-wasm to enable streaming without a strong scale-driven justification (this is a Ch 33 question)
- Streaming patterns that don’t work on Obsidian Mobile — must work or degrade gracefully
- Speculative incremental-view-maintenance without a use case — IVM is heavy; needs concrete payoff
- Async patterns that block the main thread — Obsidian’s UI must stay responsive
Success criteria for the deliverable
Section titled “Success criteria for the deliverable”The deliverable must produce:
- Streaming-pattern survey — 7+ engines × dimensions (model / chunked / incremental / out-of-core / WASM-feasible)
- Primitive-by-primitive streamability — for each of the 7 query primitives: streamable / partial / requires-materialization
- Crosswalker query streaming plans — concrete chunked execution paths for 4 representative queries
- WASM streaming feasibility analysis — what works / what’s blocked / native-fallback path
- Mobile streaming story — graceful degradation / error UX
- Incremental view maintenance verdict — where it pays off in Crosswalker; v0.2+ vs deferred
- Cross-tier streaming protocol — T1↔T2↔T3 streaming spec (or argument for sync-only)
- Recommended v0.1.7+ implementation — minimum streaming surface; deferred items
Anchored references
Section titled “Anchored references”Project context:
- v0.1.4.5 streaming refactor (CHANGELOG)
- v0.1.5 Tier 2 sidecar shipped
concepts/embedded-vs-server-substrates
Streaming-execution references:
- DuckDB out-of-core query execution
- Polars streaming engine
- Apache DataFusion architecture
- Apache Arrow Flight
- Materialize incremental view maintenance
- Datomic indexes
Adjacent Crosswalker challenges:
- Ch 33 — Multi-modal landscape audit (sister; substrate alternatives)
- Ch 37 — Tier 2-Lite scale rerun (sister; scale ceiling under ontology-web framing)
Hand-off
Section titled “Hand-off”Write the deliverable to docs/.../zz-research/YYYY-MM-DD-challenge-34-deliverable-a-<slug>.md. After deliverable lands: update synthesis log §9 status Ch 34 row from ⏳ to ✅; update v0.1.7 milestone scope with streaming patterns; archive this brief.