v0.1.4.5 — Streaming refactor (engine accepts row iterator)
Make the bundled engine streaming-by-design so it can handle source files larger than RAM. Refactor generateFromRecipe and generateNotes to iterate via for await over an AsyncIterable<Row> instead of pulling rows from an in-memory ParsedData.rows[] array. PapaParse’s step callback feeds the async iterator directly — rows never accumulate.
This is the implementation of Mode 1 (bundled projector) at scale, per the 2026-05-05 two-mode architecture decision. Without it, ChunkyCSV / JSONaut composition with the bundled engine caps at RAM-sized inputs, defeating the purpose.
Status
Section titled “Status”✅ Done (2026-05-05). ParsedData.rows: Row[] | AsyncIterable<Row> shipped with full backwards-compat. New parseCSVFileStream() returns ParsedData where rows is an AsyncIterable that pulls per-row from PapaParse via the step callback with backpressure (buffer caps at 100 rows). Engine for await loops in generateNotes + generateFromRecipe consume both forms identically. 153 tests passing (116 unit + 37 E2E across 8 spec files including new streaming.spec.ts with 4 tests covering AsyncIterable consumption + eager-array backwards-compat).
Dependencies
Section titled “Dependencies”- v0.1.4 — Junction notes + crosswalk edges (engine + Path A native recipe entry shipped)
- 2026-05-05 two-mode architecture decision (commits to streaming-by-design)
Blocks: v0.1.5 — Tier 2 sidecar (sidecar projector also iterates the vault streaming-style; cleanest if engine streaming foundation is in place first)
In:
ParsedData.rows: Row[] | AsyncIterable<Row>— streaming-friendly union; backwards-compatible with existing array consumersparseCSVStream(file)— returnsAsyncIterable<Row>directly via PapaParse step callback piped to async generator (noresults.data[]accumulation)generateFromRecipeandgenerateNotesaccept either form viafor awaititeration- E2E test (
tests/e2e/streaming.spec.ts) — feeds a large synthetic CSV (~1M rows or memory-budget-equivalent), confirms RAM doesn’t grow with input size, confirms output Tier 1 vault is correct - Wizard preview unchanged — still loads first ~50 rows into a small
ParsedDatafor preview UX - Wizard import-run uses streaming path automatically when source > 5MB (existing
shouldUseStreaming()threshold)
Out:
- CLI pipe interface (
chunkycsv messy.xlsx | crosswalker-cli import) — defer to v0.5+ - XML / RDF streaming parsers — defer to v0.2+
- Web Worker offload of parse — Ch 23 §9.5 says workers are unreliable for this; main-thread cooperative-yield is the v0.1 path
- Programmatic AsyncIterable callers (advanced composition pattern) — exposed via
plugin.runImportFromRecipeAPI; doc page on it deferred to v0.2
Concrete tasks
Section titled “Concrete tasks”- Update
ParsedDatatype insrc/types/config.ts—rows: Row[] | AsyncIterable<Row> - New
src/import/parsers/csv-parser-stream.ts—parseCSVStream(file): AsyncIterable<Row>; PapaParsestepcallback writes into a queue; async generator pulls from the queue - Refactor
generateFromRecipe(insrc/generation/generation-engine.ts) — change row loop fromfor (let i = 0; i < parsedData.rows.length; i++)tofor await (const row of asyncRowsOf(rowSource)); helperasyncRowsOfnormalizes either array or iterator into AsyncIterable - Refactor
generateNotessimilarly (legacy column-role path) - Update
runImportandrunImportFromRecipeplugin handles to accept the streaming-friendly form - Wizard: detect
shouldUseStreaming(file); if yes, useparseCSVStreamdirectly without intermediateParsedData.rows[]accumulation - Tests:
- Unit: streaming parser yields rows in order
- Unit: engine consumes AsyncIterable and produces same output as it does for an array
- E2E (
tests/e2e/streaming.spec.ts): synthetic 1M-row CSV via streaming path; confirm import succeeds; confirm RAM growth stays bounded; confirm output count matches input row count - Existing E2E (full-import-flow, crosswalks, etc.) all still pass — the array-as-iterator path is the degenerate case
Success criteria
Section titled “Success criteria”- All 33 existing E2E tests pass unchanged
- New
streaming.spec.tspasses; RAM ceiling confirmed bounded (heap snapshot before/after import on 1M-row CSV stays within ~50 MB) - Wizard import-run on >100MB CSV completes without OOM (manual smoke test)
- No backwards-incompatible API changes — existing callers (passing
ParsedDatawith arrayrows) keep working without modification
Files to touch
Section titled “Files to touch”src/types/config.ts—ParsedData.rowstype unionsrc/import/parsers/csv-parser.ts— keep existing path; addparseCSVStreamexportsrc/import/parsers/csv-parser-stream.ts— NEW (or extend csv-parser.ts)src/generation/generation-engine.ts—for awaitloop;asyncRowsOf()helpersrc/main.ts—runImport+runImportFromRecipesignature compatibilitysrc/import/import-wizard.ts— wizard import-run uses streaming path for large filestests/csv-parser.test.ts— extend with streaming test casestests/e2e/streaming.spec.ts— NEW
Open questions
Section titled “Open questions”- Should
AsyncIterablebe the only shape going forward, with arrays auto-promoted? Lean: keep both for zero-friction backwards compat - Wizard progress UX when
rowCountHintis unknown (true streaming case) — show “N rows processed” without a percentage? Lean yes - When the parser fails mid-stream (malformed row 500K of 1M), do we abort the import or continue past the bad row? Lean: configurable; default abort with clear error pointing at row N
Related
Section titled “Related”Concept pages:
- ETL and import — two-mode architecture; streaming is at the engine boundary
- Terminology — ParsedData (now streaming-friendly), AsyncIterable, streaming projector
Agent context:
Design decisions (synthesis logs):
- 2026-05-05 two-mode architecture decision — what triggers this milestone
- 2026-05-05 ETL pipeline clarification — earlier framing
- Ch 22 synthesis — render() purity makes streaming safe
- Ch 23 synthesis — §9.5 main-thread cooperative-yield
Research challenges:
- Ch 25 — Two-mode architecture and streaming (resolved) — option-space + decision
External producer ecosystem (Mode 1 feeders this enables):
- ChunkyCSV (user’s tool) — natural feeder for streaming CSV cleanup
- JSONaut (user’s tool) — natural feeder for streaming JSON cleanup
- dbt / Polars / DuckDB — Mode 1 feeders for SQL-shaped sources
Other milestones:
- v0.1.4 — Junction notes + crosswalk edges — dependency
- v0.1.5 — Tier 2 sidecar — what this unblocks
- Milestone hub