Skip to content
🚧 Early alpha — building the foundation. See the roadmap →

Challenge 25: Two-mode import architecture and streaming

Created Updated

During v0.1.5 planning, the user surfaced two intertwined concerns about the import architecture as it had been framed in the schema-as-primitive commitment:

  1. The architectural question: external ETL tools shouldn’t have to learn Markdown emission to be Crosswalker producers. ChunkyCSV / JSONaut were built specifically to clean messy sources into structured form (CSV / JSON). Forcing them to also emit Tier 1 Markdown turns them from “natural composable feeders” into “bespoke Crosswalker producers.” That’s a regression of composability.

  2. The streaming question: the current bundled engine uses PapaParse with streaming-mode parsing for files >5MB, but the rows still accumulate into ParsedData.rows[] before the engine starts iterating. Caps the engine at RAM-sized inputs. ChunkyCSV / JSONaut were built specifically to stream beyond RAM; the bundled engine should be streaming-by-design too.

The earlier framing — “external producers emit Tier 1 directly” (one path) — was insufficient. It made external ETL feel like a fallback rather than a first-class composition partner. This challenge re-opens the option-space and resolves it.

Option A — External producers emit Tier 1 directly (the earlier framing)

Section titled “Option A — External producers emit Tier 1 directly (the earlier framing)”
External tool → Tier 1 Markdown
Bundled engine → Tier 1 Markdown
(both produce Tier 1; no shared structured boundary)

Pros:

  • One contract (Tier 1 schema)
  • Pure schema-as-primitive
  • Validator surface is simple

Cons:

  • External tools have to learn Markdown emission
  • ChunkyCSV / JSONaut / dbt / Polars don’t fit naturally — they’re tabular tools, not Markdown tools
  • Producer ecosystem grows slower because the cost-of-entry is higher
  • Doesn’t address streaming (each producer handles it independently; bundled engine still has the in-memory bug)

Option B — Introduce a “Tier 0.5” intermediate schema

Section titled “Option B — Introduce a “Tier 0.5” intermediate schema”
External tool → Tier 0.5 (some normalized JSON/CSV format) → Bundled engine → Tier 1

Pros:

  • External tools just emit one canonical structured shape
  • Bundled engine has a narrow, well-typed input
  • Streaming naturally happens at ETL boundary

Cons:

  • Two contracts (Tier 0.5 + Tier 1)
  • Schema-as-primitive becomes “two-schemas-as-primitives” — weakens the central commitment
  • Yet another schema to design, document, validate, evolve
  • AI agents (Mode 2) shouldn’t have to materialize a Tier 0.5 representation when they already produce concepts

Option C — Two-mode architecture (the resolution)

Section titled “Option C — Two-mode architecture (the resolution)”
Mode 1: External tool → CSV/JSON/XML/XLSX → Bundled engine → Tier 1
Mode 2: External tool → Tier 1 Markdown directly
(both modes first-class; no new intermediate schema)

Pros:

  • One canonical contract (Tier 1)
  • Bundled engine accepts existing structured formats (no new schema needed)
  • ChunkyCSV / JSONaut / dbt / Polars compose naturally as Mode 1 feeders
  • AI agents / MCP servers / marketplace bundles use Mode 2
  • Streaming refactor at the engine boundary fixes the OOM bug for Mode 1 and improves Mode 2 (which is naturally per-file streaming)
  • Schema-as-primitive preserved (Tier 1 is still THE contract; the “structured input” formats are just file types the engine knows how to parse, no new schema)

Cons:

  • Doc surface has to clearly explain both modes (mitigated by the updated etl-and-import concept page)
  • Wizard has to handle “small file → load and preview” + “large file → stream and progress-bar” cases (manageable)

This is the chosen option.

Investigation areas (preserved for posterity)

Section titled “Investigation areas (preserved for posterity)”

1. What structured formats should Mode 1 accept?

Section titled “1. What structured formats should Mode 1 accept?”
Formatv0.1 ships withStreaming primitiveNotes
CSV✅ Yes (PapaParse)step callback → AsyncIterableMost common ETL output
JSON (records array)✅ Yesstreaming JSON parser (e.g. stream-json)Common from SQL warehouses
JSON (single object with nested arrays)Partialrecursive descent (sync OK for small, requires streaming JSON for large)OSCAL bundles fit this shape
XLSXPartial (not yet integrated)xlsx package; reads sheet-by-sheetFrameworks like NIST CSF often arrive as XLSX
XML❌ Not v0.1sax-style streamingRDF/OWL ontologies; defer to v0.2+
RDF/Turtle❌ Not v0.1rdflib streamingDefer to v0.2+; covered by Path B (external tool emits CSV) for now

2. How does the bundled engine consume a streaming row source?

Section titled “2. How does the bundled engine consume a streaming row source?”
interface StreamingRowSource {
    columns: string[];                              // header row
    rows: AsyncIterable<Row> | Iterable<Row>;        // streaming friendly
    rowCountHint?: number;                           // optional progress
    sourceMeta?: { file?: string; bytes?: number };  // for _crosswalker
}

async function generateFromRecipeStreaming(
    app: App,
    rowSource: StreamingRowSource,
    recipe: Recipe,
    options: RecipeImportOptions,
    debug?: DebugLog,
): Promise<GenerationResult> {
    let rowNum = 0;
    for await (const row of rowSource.rows) {
        rowNum += 1;
        const identity = identityFromRow(row, recipe, rowNum);
        const address = render(recipe, identity);
        const frontmatter = composeFrontmatter(address, row, recipe, sourceMeta);
        const validation = validateTier1Frontmatter(frontmatter);
        if (!validation.valid && options.strict) {
            // collect error, skip
            continue;
        }
        await mergeAndWrite(app, address.primary.path, frontmatter, body);
        // row goes out of scope; GC reclaims
    }
}

Memory ceiling: O(1) regardless of input size.

  • For files ≤ 5MB: parse fully, show preview, run import via in-memory ParsedData (existing path).
  • For files > 5MB: parse a sample (first 50 rows) for preview only; on import-run, stream the full file row-by-row.
  • Progress reporting: use rowCountHint if available (CSV with known size); otherwise show “row N processed” without a percentage.

4. How does composition with ChunkyCSV / JSONaut work in practice?

Section titled “4. How does composition with ChunkyCSV / JSONaut work in practice?”

Three composition patterns:

(a) File handoff (simplest):

ChunkyCSV reads messy XLSX → writes cleaned.csv on disk
User opens Crosswalker wizard → picks cleaned.csv → runs import

(b) CLI pipe (v0.5+):

chunkycsv messy.xlsx --clean | crosswalker-cli import --recipe nist.json --vault ./my-vault
(deferred — v0.1 ships in-Obsidian wizard only)

(c) Programmatic (advanced):

JS/TS program runs ChunkyCSV → produces AsyncIterable<Row> → calls plugin.runImportFromRecipe directly

Pattern (a) is fully supported by v0.1.4.5. Patterns (b) and (c) are v0.5+ work.

5. How does this interact with the streaming bug fix?

Section titled “5. How does this interact with the streaming bug fix?”

The streaming bug fix IS the implementation of Option C. They’re the same work. v0.1.4.5 ships:

  • ParsedData.rows: Row[] | AsyncIterable<Row> — streaming-friendly
  • parseCSVStream(file) returns AsyncIterable<Row> directly (no array accumulation)
  • generateFromRecipe iterates lazily via for await
  • E2E test for >5MB CSV file confirming RAM doesn’t grow with input size

The user was already sitting on the answer (their own ChunkyCSV / JSONaut work points directly at Mode 1 composition). The “earlier framing” gap was an unintentional narrowing of the schema-as-primitive commitment that I introduced mid-session, not a researched architectural decision. Once the user articulated their concern, the resolution was just “yes, both modes; here’s the streaming refactor that makes it real.”

No external research deliverable needed. The decision is captured in the synthesis log and operationalized via v0.1.4.5 streaming refactor milestone.