Streaming primitive refactor — Layer A goes iterable-first
Where this log fits
Section titled “Where this log fits”| Layer | This refactor’s effect |
|---|---|
| A — Primitives (filter / traverse / bind / project / aggregate / anti-join / set-op / diff) | API shape changes from Array → Array to Iterable → Iterable for the streamable subset |
| B — View shapes | Consumers of primitive output (pivot, future table/list/hierarchy) materialize when they need a final shape; can short-circuit on cell-count guard before materializing |
| C — Recipes | Recipe-runtime composer (not yet built) chains primitives via for await / for; result is consumed lazily, no full intermediate result-set in memory unless required |
| Mechanism | In-memory Iterable<Row> for ≤budget; Tier 2 sqlite-wasm cursor for above-budget. Recipe declares scale hint; runtime dispatches |
User pushed back (2026-05-19): “make sure the join logic on the backend is optimized from the beginning… streaming approach… certain operations aren’t optimized yet across tooling.”
The Phase 5+6 Layer A primitives shipped as Array → Array. That was fine for unit tests + integration tests (≤30 rows), but it’s the wrong shape for the recipe runtime we’re about to wire — once primitives have array call sites, every consumer assumes the full result is in memory.
Decision: refactor every primitive to iterable-first shape NOW, before the recipe runtime wiring. Keep array overloads for compatibility with existing tests + small-data ergonomics. Document the streamability matrix per primitive.
Streamability matrix
Section titled “Streamability matrix”| Primitive | Streamable single-pass? | Implementation |
|---|---|---|
filter | ✅ Trivially — generator | function* filterStream(rows: Iterable<Row>, pred: (r: Row) => boolean): Iterable<Row> |
bind | ✅ Trivially — generator | Pure row-by-row transform |
project | ✅ Trivially — generator | Pick-subset; row-by-row |
aggregate (count/sum/min/max) | ✅ Accumulator-based | Single-pass, constant memory |
aggregate (median/percentile) | ⚠️ Needs full data OR approximation | t-digest or sorted-sample for large inputs; defer to Tier 2 SQL for v0.1.6+ |
inner / outer / anti join | ⚠️ Hybrid — hash smaller side, stream larger | Memory bounded by smaller side. Standard hash-join (DuckDB/Polars). Caller can choose which side to materialize. |
set-op (union/intersection/difference) | ⚠️ Hybrid — hash one side, stream the other | Same shape as joins. Union is streamable from one side + dedup-set. |
diff | ❌ Both sides indexed | Inherently set-comparison; stays array-based. Document the contract. |
traverse(depth=*) / closure | ❌ Iterative-fixpoint; non-streaming by nature | Stays Tier 2 SQL (recursive CTE). |
ChunkyCSV provenance
Section titled “ChunkyCSV provenance”User authored ChunkyCSV (chunked-iterable CSV processing) — same mental model: chunks flow through transforms, hash-build is the only operation that buffers, and the smaller side gets buffered while the larger streams. Crosswalker’s IMPORT side already aligns with this (PapaParse → AsyncIterable<Row> → generation engine, shipped v0.1.4.5). The QUERY-side primitives now adopt the same shape.
What we borrow from ChunkyCSV: the pattern. What we have to write ourselves: (a) integration with Tier 2 sqlite-wasm for spill-to-disk when in-memory hash exceeds budget (novel — pending Ch 34 deliverable for design validation), (b) the recipe-runtime composer that chains primitives lazily (application-specific). Hash-build join, accumulator aggregate, generator filter/bind/project are standard CS — no novel algorithms.
Why now (not deferred to v0.1.7+)
Section titled “Why now (not deferred to v0.1.7+)”The recipe runtime is the next piece (gap #1 + #5 from the 2026-05-19 shift-left audit). Once the runtime is wired with array-based primitives, every consumer (pivot, future hierarchy, future table) assumes the full row-set is materialized. Refactoring after that costs much more. This is the cheap moment.
Out-of-scope (defer to Ch 34 deliverable + v0.1.7+)
Section titled “Out-of-scope (defer to Ch 34 deliverable + v0.1.7+)”| Item | Why deferred |
|---|---|
| Spill-to-disk for joins/set-ops/diffs exceeding RAM budget | Needs Ch 34 design validation (DuckDB out-of-core, Polars streaming, DataFusion chunked execution survey). Filing Ch 34 deliverable in parallel with this refactor. |
| Chunked-merge for distributed result sets | v0.2+ federation territory |
| Approximate aggregates (t-digest median, HyperLogLog distinct count) | Niche; recipe-author can fall back to Tier 2 SQL for now |
| Recipe-runtime composer (“primitive pipeline executor”) | Phase 7 or v0.1.7 Phase 1; depends on this refactor landing first |
Migration / backward-compat
Section titled “Migration / backward-compat”| Surface | Plan |
|---|---|
| Existing array-based call sites in tests | Keep the array overloads — bind(rows: Row[], ...) → Row[] calls bindStream(rows, ...) internally and materializes |
| Existing integration tests (Phase 6.1) | Continue passing; array overloads stay valid |
| New streaming call sites in recipe runtime | Use *Stream variants directly; consume via for / for await |
| Public API surface (when v0.1.7 recipe-runtime composer ships) | Generic Iterable<Row> parameter accepts both arrays AND generators |
Implementation plan (this commit)
Section titled “Implementation plan (this commit)”src/views/filter-primitive.ts(new) — extract filter from “implicit Bases-native” to an explicit module with bothfilter(rows[], pred)array form andfilterStream(rows, pred)generator formsrc/views/bind-primitive.ts— addbindStream()+bindManyStream()generatorssrc/views/join-primitives.ts— add*Streamvariants: hash-build on right side, stream leftsrc/views/set-op-primitive.ts— add streaming union/intersection/difference (right side hashed)src/views/diff-primitive.ts— documented as array-only (inherently set-comparison); no streaming varianttests/integration/streaming-primitives.test.ts(new) — exercise streaming variants over realistic fixtures wrapped as iterables; assert same output as array versions- Keep existing array overloads; existing tests stay green
Ch 34 deliverable
Section titled “Ch 34 deliverable”Filed 2026-05-08; never run. Queueing now for fresh-agent research session. The deliverable should answer:
- What spill-to-disk pattern does DuckDB / Polars / DataFusion use when a join’s hash exceeds RAM?
- How does sqlite-wasm handle out-of-core? Does it have a comparable pattern (temp tables in OPFS)?
- What’s the right “scale hint” recipe authors declare (
small / medium / large)? - Should Crosswalker route above-threshold queries to a worker thread for cooperative yielding?
Once the deliverable lands, the spill-to-disk + chunked-merge integration code goes into Phase 7 / v0.1.7 Phase 1.
Empirical validation (2026-05-19, desktop Electron, in-vault benchmark)
Section titled “Empirical validation (2026-05-19, desktop Electron, in-vault benchmark)”Ran Crosswalker: Run primitives benchmark (perf) at scales 1k / 100k / 1M on real desktop hardware. Key results at 1M rows:
| Op | array | stream | winner |
|---|---|---|---|
| filter | 57.6ms | 15.7ms | stream 3.6× |
| set-op-intersection | 331ms | 135ms | stream 2.4× |
| inner-join | 690ms | 434ms | stream 1.6× |
| left-outer-join | 490ms | 514ms | ~tied |
| anti-join | 105ms | 107ms | ~tied |
| diff (array-only) | 643ms | — | slowest; can’t stream |
Confirmed:
- Streaming wins decisively where it matters — joins + set-ops + filter at scale. The iterable-first refactor paid off as designed.
left-outer-join/anti-jointie — they preserve all left rows so output is input-sized either way; no materialization advantage.diffis the bottleneck (643ms, non-streamable) — both sides indexed. The candidate for Tier 2 SQL spill at ontology-web scale (Ch 34 work).inner-joinat 1M+ is the second candidate.
Methodology caveats (honest):
- The
heapDeltaBytesnumbers are GC-noisy and directional only — negative deltas (-107MB,-615MB) mean GC fired mid-op, not that memory shrank.performance.memoryaround a single sync call catches collection events. Real retained-memory measurement needs forced GC between ops (--expose-gc, unavailable in Obsidian) or steady-state sampling. Treat heap data as “this op allocates a large working set” (joins/diff/bind ~80–110MB at 1M) vs “this op stays flat” (scan-shaped ops), not as precise figures. - The benchmark harness under-sells streaming’s memory win: it calls
consume()to materialize every streaming result into an array for apples-to-apples timing. In a real pipeline (filterStream → joinStream → render first N) the full intermediate is never materialized — that win is invisible to this harness, which only proves the speed win.
Scale verdict: ≤100k rows all sub-55ms (no concern). 1M heavy ops 0.4–0.7s each, briefly froze the UI (synchronous; expected). Desktop-fine; mobile would need chunking/worker-thread (Ch 34).
Related
Section titled “Related”- v0.1.4.5 streaming refactor — IMPORT-side
AsyncIterable<Row>shipped - Phase 5 scope log — the join substrate this refactor reshapes
- Phase 6.1 integration tests — the realistic-data test foundation this builds on
- Ch 34 — Streaming / chunked query execution at ontology-web scale — research brief; deliverable to be filed in parallel with this work
- Query primitives concept — locked 8-primitive set; this refactor updates the implementation, not the verbs