🚧 Early alpha — building the foundation. See the roadmap →

Streaming primitive refactor — Layer A goes iterable-first

Created May 19, 2026 Updated Jun 1, 2026

Where this log fits

Layer	This refactor’s effect
A — Primitives (filter / traverse / bind / project / aggregate / anti-join / set-op / diff)	API shape changes from `Array → Array` to `Iterable → Iterable` for the streamable subset
B — View shapes	Consumers of primitive output (pivot, future table/list/hierarchy) materialize when they need a final shape; can short-circuit on cell-count guard before materializing
C — Recipes	Recipe-runtime composer (not yet built) chains primitives via `for await` / `for`; result is consumed lazily, no full intermediate result-set in memory unless required
Mechanism	In-memory `Iterable<Row>` for ≤budget; Tier 2 sqlite-wasm cursor for above-budget. Recipe declares scale hint; runtime dispatches

TL;DR

User pushed back (2026-05-19): “make sure the join logic on the backend is optimized from the beginning… streaming approach… certain operations aren’t optimized yet across tooling.”

The Phase 5+6 Layer A primitives shipped as Array → Array. That was fine for unit tests + integration tests (≤30 rows), but it’s the wrong shape for the recipe runtime we’re about to wire — once primitives have array call sites, every consumer assumes the full result is in memory.

Decision: refactor every primitive to iterable-first shape NOW, before the recipe runtime wiring. Keep array overloads for compatibility with existing tests + small-data ergonomics. Document the streamability matrix per primitive.

Streamability matrix

Primitive	Streamable single-pass?	Implementation
`filter`	✅ Trivially — generator	`function* filterStream(rows: Iterable<Row>, pred: (r: Row) => boolean): Iterable<Row>`
`bind`	✅ Trivially — generator	Pure row-by-row transform
`project`	✅ Trivially — generator	Pick-subset; row-by-row
`aggregate` (count/sum/min/max)	✅ Accumulator-based	Single-pass, constant memory
`aggregate` (median/percentile)	⚠️ Needs full data OR approximation	t-digest or sorted-sample for large inputs; defer to Tier 2 SQL for v0.1.6+
`inner / outer / anti join`	⚠️ Hybrid — hash smaller side, stream larger	Memory bounded by smaller side. Standard hash-join (DuckDB/Polars). Caller can choose which side to materialize.
`set-op` (union/intersection/difference)	⚠️ Hybrid — hash one side, stream the other	Same shape as joins. Union is streamable from one side + dedup-set.
`diff`	❌ Both sides indexed	Inherently set-comparison; stays array-based. Document the contract.
`traverse(depth=*)` / closure	❌ Iterative-fixpoint; non-streaming by nature	Stays Tier 2 SQL (recursive CTE).

ChunkyCSV provenance

User authored ChunkyCSV (chunked-iterable CSV processing) — same mental model: chunks flow through transforms, hash-build is the only operation that buffers, and the smaller side gets buffered while the larger streams. Crosswalker’s IMPORT side already aligns with this (PapaParse → AsyncIterable<Row> → generation engine, shipped v0.1.4.5). The QUERY-side primitives now adopt the same shape.

What we borrow from ChunkyCSV: the pattern. What we have to write ourselves: (a) integration with Tier 2 sqlite-wasm for spill-to-disk when in-memory hash exceeds budget (novel — pending Ch 34 deliverable for design validation), (b) the recipe-runtime composer that chains primitives lazily (application-specific). Hash-build join, accumulator aggregate, generator filter/bind/project are standard CS — no novel algorithms.

Why now (not deferred to v0.1.7+)

The recipe runtime is the next piece (gap #1 + #5 from the 2026-05-19 shift-left audit). Once the runtime is wired with array-based primitives, every consumer (pivot, future hierarchy, future table) assumes the full row-set is materialized. Refactoring after that costs much more. This is the cheap moment.

Out-of-scope (defer to Ch 34 deliverable + v0.1.7+)

Item	Why deferred
Spill-to-disk for joins/set-ops/diffs exceeding RAM budget	Needs Ch 34 design validation (DuckDB out-of-core, Polars streaming, DataFusion chunked execution survey). Filing Ch 34 deliverable in parallel with this refactor.
Chunked-merge for distributed result sets	v0.2+ federation territory
Approximate aggregates (t-digest median, HyperLogLog distinct count)	Niche; recipe-author can fall back to Tier 2 SQL for now
Recipe-runtime composer (“primitive pipeline executor”)	Phase 7 or v0.1.7 Phase 1; depends on this refactor landing first

Migration / backward-compat

Surface	Plan
Existing array-based call sites in tests	Keep the array overloads — `bind(rows: Row[], ...) → Row[]` calls `bindStream(rows, ...)` internally and materializes
Existing integration tests (Phase 6.1)	Continue passing; array overloads stay valid
New streaming call sites in recipe runtime	Use `*Stream` variants directly; consume via `for` / `for await`
Public API surface (when v0.1.7 recipe-runtime composer ships)	Generic `Iterable<Row>` parameter accepts both arrays AND generators

Implementation plan (this commit)

src/views/filter-primitive.ts (new) — extract filter from “implicit Bases-native” to an explicit module with both filter(rows[], pred) array form and filterStream(rows, pred) generator form
src/views/bind-primitive.ts — add bindStream() + bindManyStream() generators
src/views/join-primitives.ts — add *Stream variants: hash-build on right side, stream left
src/views/set-op-primitive.ts — add streaming union/intersection/difference (right side hashed)
src/views/diff-primitive.ts — documented as array-only (inherently set-comparison); no streaming variant
tests/integration/streaming-primitives.test.ts (new) — exercise streaming variants over realistic fixtures wrapped as iterables; assert same output as array versions
Keep existing array overloads; existing tests stay green

Ch 34 deliverable

Filed 2026-05-08; never run. Queueing now for fresh-agent research session. The deliverable should answer:

What spill-to-disk pattern does DuckDB / Polars / DataFusion use when a join’s hash exceeds RAM?
How does sqlite-wasm handle out-of-core? Does it have a comparable pattern (temp tables in OPFS)?
What’s the right “scale hint” recipe authors declare (small / medium / large)?
Should Crosswalker route above-threshold queries to a worker thread for cooperative yielding?

Once the deliverable lands, the spill-to-disk + chunked-merge integration code goes into Phase 7 / v0.1.7 Phase 1.

Empirical validation (2026-05-19, desktop Electron, in-vault benchmark)

Ran Crosswalker: Run primitives benchmark (perf) at scales 1k / 100k / 1M on real desktop hardware. Key results at 1M rows:

Op	array	stream	winner
filter	57.6ms	15.7ms	stream 3.6×
set-op-intersection	331ms	135ms	stream 2.4×
inner-join	690ms	434ms	stream 1.6×
left-outer-join	490ms	514ms	~tied
anti-join	105ms	107ms	~tied
diff (array-only)	643ms	—	slowest; can’t stream

Confirmed:

Streaming wins decisively where it matters — joins + set-ops + filter at scale. The iterable-first refactor paid off as designed.
left-outer-join / anti-join tie — they preserve all left rows so output is input-sized either way; no materialization advantage.
diff is the bottleneck (643ms, non-streamable) — both sides indexed. The candidate for Tier 2 SQL spill at ontology-web scale (Ch 34 work). inner-join at 1M+ is the second candidate.

Methodology caveats (honest):

The heapDeltaBytes numbers are GC-noisy and directional only — negative deltas (-107MB, -615MB) mean GC fired mid-op, not that memory shrank. performance.memory around a single sync call catches collection events. Real retained-memory measurement needs forced GC between ops (--expose-gc, unavailable in Obsidian) or steady-state sampling. Treat heap data as “this op allocates a large working set” (joins/diff/bind ~80–110MB at 1M) vs “this op stays flat” (scan-shaped ops), not as precise figures.
The benchmark harness under-sells streaming’s memory win: it calls consume() to materialize every streaming result into an array for apples-to-apples timing. In a real pipeline (filterStream → joinStream → render first N) the full intermediate is never materialized — that win is invisible to this harness, which only proves the speed win.

Scale verdict: ≤100k rows all sub-55ms (no concern). 1M heavy ops 0.4–0.7s each, briefly froze the UI (synchronous; expected). Desktop-fine; mobile would need chunking/worker-thread (Ch 34).

v0.1.4.5 streaming refactor — IMPORT-side AsyncIterable<Row> shipped
Phase 5 scope log — the join substrate this refactor reshapes
Phase 6.1 integration tests — the realistic-data test foundation this builds on
Ch 34 — Streaming / chunked query execution at ontology-web scale — research brief; deliverable to be filed in parallel with this work
Query primitives concept — locked 8-primitive set; this refactor updates the implementation, not the verbs