Skip to content
🚧 Early alpha — building the foundation. See the roadmap →

Streaming primitive refactor — Layer A goes iterable-first

Created Updated
LayerThis refactor’s effect
A — Primitives (filter / traverse / bind / project / aggregate / anti-join / set-op / diff)API shape changes from Array → Array to Iterable → Iterable for the streamable subset
B — View shapesConsumers of primitive output (pivot, future table/list/hierarchy) materialize when they need a final shape; can short-circuit on cell-count guard before materializing
C — RecipesRecipe-runtime composer (not yet built) chains primitives via for await / for; result is consumed lazily, no full intermediate result-set in memory unless required
MechanismIn-memory Iterable<Row> for ≤budget; Tier 2 sqlite-wasm cursor for above-budget. Recipe declares scale hint; runtime dispatches

User pushed back (2026-05-19): “make sure the join logic on the backend is optimized from the beginning… streaming approach… certain operations aren’t optimized yet across tooling.”

The Phase 5+6 Layer A primitives shipped as Array → Array. That was fine for unit tests + integration tests (≤30 rows), but it’s the wrong shape for the recipe runtime we’re about to wire — once primitives have array call sites, every consumer assumes the full result is in memory.

Decision: refactor every primitive to iterable-first shape NOW, before the recipe runtime wiring. Keep array overloads for compatibility with existing tests + small-data ergonomics. Document the streamability matrix per primitive.

PrimitiveStreamable single-pass?Implementation
filter✅ Trivially — generatorfunction* filterStream(rows: Iterable<Row>, pred: (r: Row) => boolean): Iterable<Row>
bind✅ Trivially — generatorPure row-by-row transform
project✅ Trivially — generatorPick-subset; row-by-row
aggregate (count/sum/min/max)✅ Accumulator-basedSingle-pass, constant memory
aggregate (median/percentile)⚠️ Needs full data OR approximationt-digest or sorted-sample for large inputs; defer to Tier 2 SQL for v0.1.6+
inner / outer / anti join⚠️ Hybrid — hash smaller side, stream largerMemory bounded by smaller side. Standard hash-join (DuckDB/Polars). Caller can choose which side to materialize.
set-op (union/intersection/difference)⚠️ Hybrid — hash one side, stream the otherSame shape as joins. Union is streamable from one side + dedup-set.
diff❌ Both sides indexedInherently set-comparison; stays array-based. Document the contract.
traverse(depth=*) / closure❌ Iterative-fixpoint; non-streaming by natureStays Tier 2 SQL (recursive CTE).

User authored ChunkyCSV (chunked-iterable CSV processing) — same mental model: chunks flow through transforms, hash-build is the only operation that buffers, and the smaller side gets buffered while the larger streams. Crosswalker’s IMPORT side already aligns with this (PapaParse → AsyncIterable<Row> → generation engine, shipped v0.1.4.5). The QUERY-side primitives now adopt the same shape.

What we borrow from ChunkyCSV: the pattern. What we have to write ourselves: (a) integration with Tier 2 sqlite-wasm for spill-to-disk when in-memory hash exceeds budget (novel — pending Ch 34 deliverable for design validation), (b) the recipe-runtime composer that chains primitives lazily (application-specific). Hash-build join, accumulator aggregate, generator filter/bind/project are standard CS — no novel algorithms.

The recipe runtime is the next piece (gap #1 + #5 from the 2026-05-19 shift-left audit). Once the runtime is wired with array-based primitives, every consumer (pivot, future hierarchy, future table) assumes the full row-set is materialized. Refactoring after that costs much more. This is the cheap moment.

Out-of-scope (defer to Ch 34 deliverable + v0.1.7+)

Section titled “Out-of-scope (defer to Ch 34 deliverable + v0.1.7+)”
ItemWhy deferred
Spill-to-disk for joins/set-ops/diffs exceeding RAM budgetNeeds Ch 34 design validation (DuckDB out-of-core, Polars streaming, DataFusion chunked execution survey). Filing Ch 34 deliverable in parallel with this refactor.
Chunked-merge for distributed result setsv0.2+ federation territory
Approximate aggregates (t-digest median, HyperLogLog distinct count)Niche; recipe-author can fall back to Tier 2 SQL for now
Recipe-runtime composer (“primitive pipeline executor”)Phase 7 or v0.1.7 Phase 1; depends on this refactor landing first
SurfacePlan
Existing array-based call sites in testsKeep the array overloads — bind(rows: Row[], ...) → Row[] calls bindStream(rows, ...) internally and materializes
Existing integration tests (Phase 6.1)Continue passing; array overloads stay valid
New streaming call sites in recipe runtimeUse *Stream variants directly; consume via for / for await
Public API surface (when v0.1.7 recipe-runtime composer ships)Generic Iterable<Row> parameter accepts both arrays AND generators
  1. src/views/filter-primitive.ts (new) — extract filter from “implicit Bases-native” to an explicit module with both filter(rows[], pred) array form and filterStream(rows, pred) generator form
  2. src/views/bind-primitive.ts — add bindStream() + bindManyStream() generators
  3. src/views/join-primitives.ts — add *Stream variants: hash-build on right side, stream left
  4. src/views/set-op-primitive.ts — add streaming union/intersection/difference (right side hashed)
  5. src/views/diff-primitive.ts — documented as array-only (inherently set-comparison); no streaming variant
  6. tests/integration/streaming-primitives.test.ts (new) — exercise streaming variants over realistic fixtures wrapped as iterables; assert same output as array versions
  7. Keep existing array overloads; existing tests stay green

Filed 2026-05-08; never run. Queueing now for fresh-agent research session. The deliverable should answer:

  • What spill-to-disk pattern does DuckDB / Polars / DataFusion use when a join’s hash exceeds RAM?
  • How does sqlite-wasm handle out-of-core? Does it have a comparable pattern (temp tables in OPFS)?
  • What’s the right “scale hint” recipe authors declare (small / medium / large)?
  • Should Crosswalker route above-threshold queries to a worker thread for cooperative yielding?

Once the deliverable lands, the spill-to-disk + chunked-merge integration code goes into Phase 7 / v0.1.7 Phase 1.

Empirical validation (2026-05-19, desktop Electron, in-vault benchmark)

Section titled “Empirical validation (2026-05-19, desktop Electron, in-vault benchmark)”

Ran Crosswalker: Run primitives benchmark (perf) at scales 1k / 100k / 1M on real desktop hardware. Key results at 1M rows:

Oparraystreamwinner
filter57.6ms15.7msstream 3.6×
set-op-intersection331ms135msstream 2.4×
inner-join690ms434msstream 1.6×
left-outer-join490ms514ms~tied
anti-join105ms107ms~tied
diff (array-only)643msslowest; can’t stream

Confirmed:

  1. Streaming wins decisively where it matters — joins + set-ops + filter at scale. The iterable-first refactor paid off as designed.
  2. left-outer-join / anti-join tie — they preserve all left rows so output is input-sized either way; no materialization advantage.
  3. diff is the bottleneck (643ms, non-streamable) — both sides indexed. The candidate for Tier 2 SQL spill at ontology-web scale (Ch 34 work). inner-join at 1M+ is the second candidate.

Methodology caveats (honest):

  • The heapDeltaBytes numbers are GC-noisy and directional only — negative deltas (-107MB, -615MB) mean GC fired mid-op, not that memory shrank. performance.memory around a single sync call catches collection events. Real retained-memory measurement needs forced GC between ops (--expose-gc, unavailable in Obsidian) or steady-state sampling. Treat heap data as “this op allocates a large working set” (joins/diff/bind ~80–110MB at 1M) vs “this op stays flat” (scan-shaped ops), not as precise figures.
  • The benchmark harness under-sells streaming’s memory win: it calls consume() to materialize every streaming result into an array for apples-to-apples timing. In a real pipeline (filterStream → joinStream → render first N) the full intermediate is never materialized — that win is invisible to this harness, which only proves the speed win.

Scale verdict: ≤100k rows all sub-55ms (no concern). 1M heavy ops 0.4–0.7s each, briefly froze the UI (synchronous; expected). Desktop-fine; mobile would need chunking/worker-thread (Ch 34).