Skip to content
🚧 Early alpha — building the foundation. See the roadmap →

ETL pipeline clarification — ParsedData is not a tier; Tier 1 is the only intermediate format

Created Updated

Mid-v0.1.5 planning (right after v0.1.4 shipped), the user asked about the import process and sketched their mental model:

ontology → transformed into → some probably nested data format with schema or rules that work + ImportRecipe (config file → translates into vault primitives)

They had a related project (folder-tag-sync) where the templating language is complex because it requires bidirectional bijectivity, and that work was bleeding into their mental model of Crosswalker. The specific confusion: they thought Crosswalker had a static intermediate file format between source and Tier 1 — a “normalized format Crosswalker excels at” that external ETL would produce and the bundled engine would consume.

A diagram I drew in chat introduced the term ParsedData and labeled it as an implementation detail, but the labeling was insufficient — it read like a tier in the architecture. This log clears that up + bridges the architectural reframe (schema-as-primitive, Ch 22 + Ch 23 + Ch 24) to the v0.1 implementation milestones.

ConceptWhat it ISWhat it ISN’T
ParsedDataAn in-memory TypeScript interface ({columns, rows, rowCount}) used inside Path A’s bundled engine during a single import session. Implementation detail.A persisted intermediate file format. NOT a tier. NOT something external producers consume or emit.
Tier 1The canonical Markdown vault format on disk, conforming to spec/tier1.schema.json. The load-bearing contract. The “normalized format Crosswalker excels at.”A serialization-only artifact — it’s the shared vocabulary every producer (bundled engine, external CLI, AI agent) must produce.

The “normalized format” the user was imagining IS Tier 1 itself. External ETL writes Tier 1 directly. There’s no extra intermediate format because adding one would also require maintaining one.

The three producer paths (re-stated explicitly)

Section titled “The three producer paths (re-stated explicitly)”

Per the schema-as-primitive commitment, three independent ways to land valid Tier 1 — all first-class, all interchangeable:

┌─ PATH A — Bundled plugin engine (v0.1 ships this) ───────┐
│   User CSV / XLSX / JSON                                 │
│       │                                                  │
│       ▼  PapaParse / xlsx / JSON.parse                   │
│   ParsedData (in-memory, ephemeral)                      │
│   ────────────────────────────────                       │
│   {columns, rows, rowCount}                              │
│   TypeScript interface; lives in RAM during one          │
│   runImport() call. NOT a tier. NOT persisted.           │
│       │                                                  │
│       ▼  + Recipe → render() → Address                   │
│   Tier 1 Markdown ◄──── canonical contract               │
└──────────────────────────────────────────────────────────┘

┌─ PATH B — External CLI producer (deferred to v0.5+) ─────┐
│   User source data (any shape)                           │
│       │                                                  │
│       ▼  External tool (Python + Polars + DuckDB,        │
│          dbt, custom scripts, ChunkyCSV, JSONaut)        │
│   Tier 1 Markdown ◄──── canonical contract               │
│   (writes directly; never goes through ParsedData)       │
└──────────────────────────────────────────────────────────┘

┌─ PATH C — AI agent / MCP server ─────────────────────────┐
│   Anything (browser, API, scraped page, conversation)    │
│       │                                                  │
│       ▼  Whatever the agent does                         │
│   Tier 1 Markdown ◄──── canonical contract               │
└──────────────────────────────────────────────────────────┘

Path A is what v0.1 ships. Paths B and C are first-class architectural citizens — anyone emitting valid Tier 1 is a Crosswalker producer — but Crosswalker’s repo doesn’t ship a Path B or Path C producer in v0.1. That’s deferred to v0.5+ (Ch 23 synthesis).

Two reasons one would consider an intermediate format, and why each is rejected:

Reason to add an intermediateWhy it’s rejected
”External producers need a normalized format the engine can ingest”The engine doesn’t ingest Tier 1 — it produces Tier 1. External producers also produce Tier 1. There’s nothing to ingest. The schema is the contract; the format is the contract.
”An intermediate format would let us validate before writing files”Tier 1 frontmatter validation already happens pre-write inside Path A (since v0.1.4). External producers run their own validation against the same schema. The validator (AJV + spec/tier1.schema.json) is the shared validation surface.

Adding an intermediate format would double the contract surface — schema for the intermediate + schema for Tier 1 — without buying anything. The schema-as-primitive commitment specifically rejects this.

The Path A pipeline (per-row, fully spelled out)

Section titled “The Path A pipeline (per-row, fully spelled out)”

After v0.1.4, the bundled engine’s per-row processing:

ParsedData.rows[i]                  ◄── one row at a time


ConceptIdentity {curie, scope}       ◄── scope = the row itself; CURIE
        │                                derived from row.id or recipe template
        ▼  + Recipe
render(Recipe, ConceptIdentity)      ◄── pure function (Ch 22 §3)
        │                                single coupling point

Address {                            ◄── what render() returns
  primary: {path, anchor?},
  wikilinkTarget,
  tags[], aliases[],
  frontmatter (managed)
}

        ▼  + buildProvenance() + body
validateTier1Frontmatter()           ◄── pre-write gate (v0.1.4)
        │                                STRM predicate enforcement here;
        │                                strict mode blocks invalid rows

mergeFrontmatter(existing, new)      ◄── user_preserve survives (Ch 22 §8.4)
        │                                review_status, reviewer, custom keys

app.vault.create() / .modify()       ◄── Tier 1 file written

Things that look like intermediate formats but aren’t:

  • ConceptIdentity {curie, scope} — a function-call argument; not persisted; not a contract. Just the input shape render() requires.
  • Addressrender()’s return type; transient; gets composed with provenance + body and serialized to YAML+Markdown. Not persisted.
  • ParsedData — already covered above. RAM only.

The only persisted thing in this pipeline is the final Tier 1 file. Everything else is in-memory transient state of one runImport() call.

How the v0.1 milestones map to the architecture

Section titled “How the v0.1 milestones map to the architecture”

This was the bridging gap — the architecture pages talked about three producer paths, the milestone pages talked about implementation steps, but nothing connected them explicitly. Doing so now:

MilestoneWhat it buildsArchitectural piece
v0.1.1AJV + spec/*.schema.json wired into runtimeThe validator that all three producer paths use
v0.1.2render(Recipe, ConceptIdentity) → AddressThe single coupling point Path A uses; reference implementation Paths B/C may copy
v0.1.3render() wired through Path A’s per-row loop + _crosswalker provenance + user_preserve mergePath A’s bundled engine becomes spec-conformant
v0.1.4kind discriminator + native-recipe entry point + STRM enforcement gateAll 3 Tier 1 shapes (concept-note / junction-note / crosswalk-edge) producible via Path A
v0.1.5 (next).crosswalker.sqlite sidecar projection from Tier 1The Tier 1 → Tier 2 projector (deletable, recoverable)
v0.1.7Tier 1 → STRM TSV / OSCAL JSON / SSSOM TSVThe round-trip-determinism boundary; round-trip via the schema, not a separate format

Path B (external Python CLI) is not in v0.1 — it’s deferred to v0.5+ per Ch 23. When it ships, it will write Tier 1 directly via the same spec/tier1.schema.json. The architectural shape doesn’t change; just another producer joins the ecosystem.

SurfaceChange
concepts/etl-and-importNew section “The three producer paths — and what’s NOT a tier” inserted after the schema-as-primitive reframe; new subsection “How Path A works inside the plugin (v0.1 implementation)” with the per-row pipeline diagram; Related section beefed up with all 6 milestones + this log + Ch 23 synthesis + research deliverables
Future user-facing surfacesAvoid using ParsedData in any user-facing prose. It’s an implementation term internal to Path A. User docs talk about “your CSV/XLSX file” and “Tier 1 vault” — never about an intermediate format

The mental model “ETL produces a normalized format the engine consumes” is the standard mental model in industry. Most ETL tools work that way:

  • dbt: source → staging models → marts → reporting
  • Airbyte: source → normalized JSON → destination
  • Singer: source → SCV (Singer schema spec) → target

Crosswalker deliberately doesn’t. The schema-as-primitive commitment skips the intermediate-format step. This is unusual and the doc surface should accommodate readers who arrive with the standard mental model — calling out explicitly “we don’t have an intermediate format; here’s why” rather than letting them assume one exists and feel confused when it doesn’t appear.

The folder-tag-sync user’s complementary project — where bidirectional bijectivity requires a complex named-slot path-template language — primed the user’s expectation. In folder-tag-sync, the templating IS the contract because both sides have to recover from each other. In Crosswalker, the templating is a one-way projection mechanism inside Path A; the contract is Tier 1 itself. Different problem.

Concept pages updated:

  • ETL and import — new section with the three producer paths + ParsedData clarification + per-row pipeline diagram + comprehensive Related section bridging to milestones

Concept pages cross-referenced:

Agent context:

Implementation milestones (Path A):

  • v0.1.1 — validation foundation
  • v0.1.2 — render() v1
  • v0.1.3 — engine integration
  • v0.1.4 — kind discriminator + STRM enforcement
  • v0.1.5 — Tier 2 sidecar (next milestone)
  • v0.1.7 — round-trip exporters

Design decisions (synthesis logs):

Delivery logs:

Research deliverables:

Spec files: