v0.1.4.5 shipped — Streaming refactor (engine accepts row iterator)
What shipped
Section titled “What shipped”Milestone v0.1.4.5 — Streaming refactor. Status flipped to ✅ in the milestone hub.
The bundled engine is now streaming-by-design. ParsedData.rows accepts either an eager array (small/medium data — wizard preview, in-memory imports) or an AsyncIterable<Row> (true streaming — large CSV files, external pipes from ChunkyCSV/JSONaut, etc.). The generation engine iterates via for await ... of so both forms work transparently. The full source dataset never lives in RAM during streaming imports.
This is the implementation of Mode 1 (bundled projector) at scale per the 2026-05-05 two-mode architecture decision.
| Surface | Delivered |
|---|---|
src/types/config.ts ParsedData.rows | Now Row[] | AsyncIterable<Row>. New isEagerRows() type guard for callsites that need random access. |
src/import/parsers/csv-parser.ts parseCSVFileStream() | New: returns ParsedData where rows is an AsyncIterable that pulls per-row from PapaParse via the step callback. Backpressure: parser pauses at 100-row buffer, resumes when consumer drains below 10. Memory ceiling ~100 × avg-row-bytes. |
src/generation/generation-engine.ts generateNotes + generateFromRecipe | Per-row loops refactored from for (let i = 0; ...) to for await (const row of ...). Both forms (array + AsyncIterable) iterate identically. |
analyzeColumns() (csv-parser) | Type-guarded against streaming sources — returns shape-only column info when rows is AsyncIterable (column type detection requires materialization, which defeats streaming). |
estimateOutput() (generation-engine) | Type-guarded — only computes folder/link estimates for eager-array sources. |
Wizard preview (import-wizard.ts) | Type-guarded — preview path only works with eager arrays (which is what parseCSVFile() returns). Streaming-mode imports use the wizard’s import-run path directly without going through preview. |
| Suite | Count | Detail |
|---|---|---|
| Jest unit tests | 116 (was 108, +8) | New: tests/streaming-async-iter.test.ts — isEagerRows type guard + AsyncIterable consumption pattern + memory-pattern check |
| WebDriver E2E | 37 across 8 spec files (was 33 across 7) | New: tests/e2e/streaming.spec.ts — 4 tests: AsyncIterable source via runImportFromRecipe, correct file paths, frontmatter content, eager-array backwards-compat |
| Total | 153 passing | All green before commit |
The new streaming.spec.ts is the v0.1.4.5 success-criterion gate. It does:
- Builds an
AsyncIterable<Row>in the renderer-sideexecuteObsidiancallback (mimics what an external producer would hand toplugin.runImportFromRecipe) - Yields 100 rows on demand via
Symbol.asyncIterator.next()— never materializes an array - Asserts the engine consumes them via
runImportFromRecipe(parsedData, recipe, {...})withparsedData.rowCount: -1(unknown count, streaming) - Asserts 100 Tier 1 files appear at expected paths
- Reads frontmatter for a sample file via
metadataCache— verifies_crosswalkerprovenance + managed values are correct - Re-runs with eager array (degenerate case) → verifies backwards-compat
Notable design decisions made during implementation
Section titled “Notable design decisions made during implementation”-
Type union, not breaking change.
ParsedData.rows: Row[] | AsyncIterable<Row>is backwards-compatible — existing callers that pass arrays continue to work. Only callsites that need.length/.slice/[index]need a type guard. Five callsites in the codebase needed it; all were straightforward. -
for await ... ofworks on both forms. TypeScript narrows correctly inside the loop. No need for explicit conversion. The engine’s per-row code is identical across forms. -
Backpressure via PapaParse
parser.pause()/resume(). PapaParse’s step callback receives a parser handle that supports flow control. We pause when the buffer hits 100 rows; resume when consumer drains below 10. Memory ceiling is bounded. -
rowCount: -1signals “unknown count, streaming.” Engine code that usesrowCountfor progress reporting falls back to “row N processed” without a percentage when count is unknown. -
Wizard preview is unchanged. The wizard still loads a small ParsedData via
parseCSVFile()(eager array form) for the column-config + preview steps. That hasn’t changed because the preview UX requires random access. The only thing that changes for the wizard is when it triggers the streaming import — for files larger than 5MB, the wizard could useparseCSVFileStream()for the actual generation pass while keepingparseCSVFile()for the preview. v0.1.4.5 ships the foundation; wizard threshold-switching is a small follow-on UI change. -
No worker offload. Per Ch 23 §9.5, Web Workers are unreliable for this workload (ParsedData transfer cost; debugger UX; mobile portability). Main-thread cooperative-yield via
await new Promise(setTimeout)between batches is the v0.1 path. The streaming refactor doesn’t change this — PapaParse still runs on the main thread.
How v0.1.4.5 plugs into the system
Section titled “How v0.1.4.5 plugs into the system”Interfaces this milestone introduces / changes
Section titled “Interfaces this milestone introduces / changes”| Interface | Status |
|---|---|
ParsedData.rows: Row[] | AsyncIterable<Row> | ✅ Type union; backwards-compatible |
isEagerRows() type guard | ✅ Exported from src/types/config.ts |
parseCSVFileStream(file, options) → ParsedData | ✅ NEW; returns ParsedData with AsyncIterable rows |
Engine per-row loops use for await | ✅ Live in generateNotes + generateFromRecipe |
Type guards added to analyzeColumns, estimateOutput, wizard preview | ✅ Live; gracefully handle streaming sources |
What did NOT change in this milestone
Section titled “What did NOT change in this milestone”- Wizard UI surfaces — same column-config flow, same preview rendering. Only the underlying ParsedData rows-type became broader.
- Recipe schema — unchanged.
- Tier 1 schema — unchanged.
- render() — unchanged. Pure function; per-row input/output identical.
- Existing E2E specs — all 33 prior E2E tests pass without modification (the eager-array path is the degenerate case of the new union).
- Web Worker offload — still not used. Main-thread cooperative-yield per Ch 23 §9.5.
- Wizard auto-streaming threshold — wizard still uses
parseCSVFile()for everything. Threshold-based switching toparseCSVFileStream()for >5MB files is a small UI follow-on; not v0.1.4.5 scope.
Memory rules followed this session
Section titled “Memory rules followed this session”- ✅ Always test thoroughly — milestone gate is
streaming.spec.tsE2E + 8 new unit tests + verified all 33 prior E2E tests still pass - ✅ No personal data in commits/logs (pre-commit sweep clean)
- ✅ No AI co-author attribution in commits
What this unblocks for v0.1.5 (Tier 2 sqlite-wasm sidecar)
Section titled “What this unblocks for v0.1.5 (Tier 2 sqlite-wasm sidecar)”- The Tier 2 projector (next milestone) walks the vault
.mdfiles to populate the SQLite sidecar. With streaming foundation in place, the projector can iterate vault files lazily —vault.getMarkdownFiles()returns the file list, but reading + parsing frontmatter happens per-file with cooperative yielding, never accumulating the full vault state in RAM. - The
closure_cacherecursive CTE materialization can stream through themappingstable in batches without exhausting WASM heap. - External producers can now hand the engine a streaming source via
plugin.runImportFromRecipe()directly. ChunkyCSV / JSONaut composition is now real — they emit AsyncIterable rows; the engine consumes them; Tier 1 vault materializes file-by-file.
Related
Section titled “Related”Concept pages:
- ETL and import (two-mode architecture) — the architecture this milestone implements
- Terminology — ParsedData (now streaming-friendly), AsyncIterable, streaming projector
- What makes Crosswalker unique — composability with external ETL is a differentiator
Agent context:
Design decisions (synthesis logs):
- 2026-05-05 two-mode architecture decision — what triggered this milestone
- Ch 22 synthesis — render() purity makes streaming safe
- Ch 23 synthesis — main-thread cooperative-yield commitment
- v0.1.4 delivery log — preceding milestone
Research challenges:
External producer ecosystem (Mode 1 feeders this enables):
- ChunkyCSV (user’s tool) — natural feeder for streaming CSV cleanup
- JSONaut (user’s tool) — natural feeder for streaming JSON cleanup
Other milestones:
- v0.1.4 — Junction notes + crosswalk edges — dependency
- v0.1.5 — Tier 2 sidecar — next milestone; benefits from streaming foundation
Next milestone
Section titled “Next milestone”v0.1.5 — Tier 2 sqlite-wasm sidecar projector — projects all three Tier 1 shapes into a .crosswalker.sqlite sidecar with sqlite-vec for vector queries. The projector iterates vault files lazily (per the streaming foundation just shipped). Per-milestone E2E spec is tests/e2e/sidecar.spec.ts.