🚧 Early alpha — building the foundation. See the roadmap →

Challenge 34: Streaming / chunked query execution at ontology-web scale

Created May 8, 2026 Updated Jun 1, 2026

Why this exists

Crosswalker’s v0.1.4.5 streaming refactor handled IMPORT-side streaming: large CSVs are processed chunk-by-chunk into Tier 1 files. The QUERY side does NOT stream — plugin.queryClosure() runs a recursive CTE that materializes the entire result set in memory. At small-vault scale (current GRC bound) this is fine. At ontology-web scale (BioPortal: 700+ ontologies; UMLS: 3.5M concepts; OBO Foundry: hundreds × thousands), it isn’t.

The user’s specific concern: “How does merging scale chunk-by-chunk to not blow memory?” This is a streaming / iterative-merge concern that prior challenges never asked.

What we already have

Asset	What it gives us
v0.1.4.5 streaming refactor	IMPORT-side streaming pattern
Tier 2 sqlite-wasm sidecar	In-memory database; OPFS-persisted; recursive CTE for closure
`plugin.queryClosure()`	Currently materializes full closure in memory
Closure cache	Lazy materialization; cached results don’t re-execute, but the cached result IS in memory

What to investigate

1. Survey streaming-execution patterns

For each of the following, document the streaming model:

DuckDB out-of-core / OOC — spill-to-disk for result sets exceeding RAM
Polars streaming (https://docs.pola.rs/user-guide/concepts/streaming/) — lazy evaluation + chunked execution; pipeline operators
Apache DataFusion — chunked execution over Arrow record batches
Apache Arrow Flight — streaming protocol for analytical query results
Materialize — incremental view maintenance (results recompute only what changed)
Datomic — immutable indexes; queries iterate rather than materialize
ClickHouse — streaming aggregates over massive tables
CRDT-based DBs (Yjs, Automerge) — eventually-consistent merging
Iterator / generator patterns — JS / TS streaming idioms

For each: how does it compose with relational + graph + vector primitives? At what scale does it pay off?

2. Ontology-web specific streaming patterns

Some queries don’t have a natural chunked execution. Specifically:

Closure (transitive reachability) — depth-first vs breadth-first; can it stream incrementally as the frontier expands?
Pivot — group-by-two-axes; does it require materializing the full input or can it stream by pre-sorting?
Anti-join — “X without Y” — needs both sides; can it stream by hashing?
Multi-ontology join — joining concepts across 5+ ontologies; can it stream by ontology?

For each query primitive, characterize: streamable? requires-buffer? requires-materialization?

3. Crosswalker-specific streaming patterns

Apply the survey to Crosswalker’s actual queries. For each of these typical ontology-web queries, design a streaming execution path:

Closure of MITRE ATT&CK techniques reachable from a NIST 800-53 control via 3+ hops
Coverage matrix for NIST CSF × NIST 800-53 × ISO 27001 × CIS Controls (4-way pivot)
Anti-join: SKOS subjects with no LCSH equivalent
OBO Foundry: gene-ontology terms with no MONDO disease mapping (anti-join across 100K+ concepts)

For each: chunked execution plan, memory profile, worst-case latency at ontology-web scale.

4. WASM-side streaming feasibility

sqlite-wasm runs in browser/Electron. Streaming faces specific WASM constraints:

No threads (without SharedArrayBuffer); can’t parallelize chunks across cores
Memory-bounded (linear memory; default 4GB cap on 32-bit WASM)
I/O happens through async OPFS or main-thread bridges

What streaming patterns work in WASM specifically? What’s blocked by WASM constraints? When do we have to drop to native (non-WASM, server-side) execution?

5. Mobile streaming

On Obsidian Mobile, RAM is even more constrained. Streaming becomes more important AND more constrained:

iOS WebView memory caps
Capacitor lacks SharedArrayBuffer (per Ch 24 / WASM-A pivot)

What’s the mobile streaming story? Is it “queries above N rows fail on mobile, with graceful error message”?

6. Incremental view maintenance vs full re-query

Materialize’s pattern (incremental view maintenance) is appealing: queries don’t re-run from scratch when the underlying data changes; only the delta is processed. Could Crosswalker adopt this pattern?

For Bases-rendered queries: Bases re-runs on every render; not naturally incremental
For materialized snapshots (v0.1.8): incremental updates would replace full regeneration
For closure cache: invalidation on any mappings change → full recompute; could be incremental if we tracked which mappings changed

Argue: where’s the v0.2+ payoff for incremental?

7. Cross-tier streaming composition

Crosswalker has 3 tiers (T1 Markdown, T2 sqlite-wasm sidecar, T3 server-side). Streaming spans tiers:

T1 → T2 projection: already streamed (v0.1.4.5)
T2 query → user surface: NOT streamed currently
T2 → T3 federation (when T3 is enabled): streaming required

What’s the cross-tier streaming protocol? Is it Arrow Flight? GraphQL subscriptions? Custom?

Anti-patterns to reject upfront

The deliverable must NOT recommend:

Premature streaming optimization — at small/medium scale, materialization is fine. Don’t add streaming complexity unless ontology-web scale forces it
Streaming for primitives that fundamentally need materialization (full anti-join with arbitrary predicates) — argue when streaming is impossible
Forking sqlite-wasm to add streaming — out of scope
Migrating off sqlite-wasm to enable streaming without a strong scale-driven justification (this is a Ch 33 question)
Streaming patterns that don’t work on Obsidian Mobile — must work or degrade gracefully
Speculative incremental-view-maintenance without a use case — IVM is heavy; needs concrete payoff
Async patterns that block the main thread — Obsidian’s UI must stay responsive

Success criteria for the deliverable

The deliverable must produce:

Streaming-pattern survey — 7+ engines × dimensions (model / chunked / incremental / out-of-core / WASM-feasible)
Primitive-by-primitive streamability — for each of the 7 query primitives: streamable / partial / requires-materialization
Crosswalker query streaming plans — concrete chunked execution paths for 4 representative queries
WASM streaming feasibility analysis — what works / what’s blocked / native-fallback path
Mobile streaming story — graceful degradation / error UX
Incremental view maintenance verdict — where it pays off in Crosswalker; v0.2+ vs deferred
Cross-tier streaming protocol — T1↔T2↔T3 streaming spec (or argument for sync-only)
Recommended v0.1.7+ implementation — minimum streaming surface; deferred items

Anchored references

Project context:

Streaming-execution references:

Adjacent Crosswalker challenges:

Ch 33 — Multi-modal landscape audit (sister; substrate alternatives)
Ch 37 — Tier 2-Lite scale rerun (sister; scale ceiling under ontology-web framing)

Hand-off

Write the deliverable to docs/.../zz-research/YYYY-MM-DD-challenge-34-deliverable-a-<slug>.md. After deliverable lands: update synthesis log §9 status Ch 34 row from ⏳ to ✅; update v0.1.7 milestone scope with streaming patterns; archive this brief.