ETL pipeline clarification — ParsedData is not a tier; Tier 1 is the only intermediate format
What triggered this
Section titled “What triggered this”Mid-v0.1.5 planning (right after v0.1.4 shipped), the user asked about the import process and sketched their mental model:
ontology → transformed into → some probably nested data format with schema or rules that work + ImportRecipe (config file → translates into vault primitives)
They had a related project (folder-tag-sync) where the templating language is complex because it requires bidirectional bijectivity, and that work was bleeding into their mental model of Crosswalker. The specific confusion: they thought Crosswalker had a static intermediate file format between source and Tier 1 — a “normalized format Crosswalker excels at” that external ETL would produce and the bundled engine would consume.
A diagram I drew in chat introduced the term ParsedData and labeled it as an implementation detail, but the labeling was insufficient — it read like a tier in the architecture. This log clears that up + bridges the architectural reframe (schema-as-primitive, Ch 22 + Ch 23 + Ch 24) to the v0.1 implementation milestones.
The clarification
Section titled “The clarification”| Concept | What it IS | What it ISN’T |
|---|---|---|
ParsedData | An in-memory TypeScript interface ({columns, rows, rowCount}) used inside Path A’s bundled engine during a single import session. Implementation detail. | A persisted intermediate file format. NOT a tier. NOT something external producers consume or emit. |
| Tier 1 | The canonical Markdown vault format on disk, conforming to spec/tier1.schema.json. The load-bearing contract. The “normalized format Crosswalker excels at.” | A serialization-only artifact — it’s the shared vocabulary every producer (bundled engine, external CLI, AI agent) must produce. |
The “normalized format” the user was imagining IS Tier 1 itself. External ETL writes Tier 1 directly. There’s no extra intermediate format because adding one would also require maintaining one.
The three producer paths (re-stated explicitly)
Section titled “The three producer paths (re-stated explicitly)”Per the schema-as-primitive commitment, three independent ways to land valid Tier 1 — all first-class, all interchangeable:
Path A is what v0.1 ships. Paths B and C are first-class architectural citizens — anyone emitting valid Tier 1 is a Crosswalker producer — but Crosswalker’s repo doesn’t ship a Path B or Path C producer in v0.1. That’s deferred to v0.5+ (Ch 23 synthesis).
Why no static intermediate format
Section titled “Why no static intermediate format”Two reasons one would consider an intermediate format, and why each is rejected:
| Reason to add an intermediate | Why it’s rejected |
|---|---|
| ”External producers need a normalized format the engine can ingest” | The engine doesn’t ingest Tier 1 — it produces Tier 1. External producers also produce Tier 1. There’s nothing to ingest. The schema is the contract; the format is the contract. |
| ”An intermediate format would let us validate before writing files” | Tier 1 frontmatter validation already happens pre-write inside Path A (since v0.1.4). External producers run their own validation against the same schema. The validator (AJV + spec/tier1.schema.json) is the shared validation surface. |
Adding an intermediate format would double the contract surface — schema for the intermediate + schema for Tier 1 — without buying anything. The schema-as-primitive commitment specifically rejects this.
The Path A pipeline (per-row, fully spelled out)
Section titled “The Path A pipeline (per-row, fully spelled out)”After v0.1.4, the bundled engine’s per-row processing:
Things that look like intermediate formats but aren’t:
ConceptIdentity {curie, scope}— a function-call argument; not persisted; not a contract. Just the input shaperender()requires.Address—render()’s return type; transient; gets composed with provenance + body and serialized to YAML+Markdown. Not persisted.ParsedData— already covered above. RAM only.
The only persisted thing in this pipeline is the final Tier 1 file. Everything else is in-memory transient state of one runImport() call.
How the v0.1 milestones map to the architecture
Section titled “How the v0.1 milestones map to the architecture”This was the bridging gap — the architecture pages talked about three producer paths, the milestone pages talked about implementation steps, but nothing connected them explicitly. Doing so now:
| Milestone | What it builds | Architectural piece |
|---|---|---|
| v0.1.1 | AJV + spec/*.schema.json wired into runtime | The validator that all three producer paths use |
| v0.1.2 | render(Recipe, ConceptIdentity) → Address | The single coupling point Path A uses; reference implementation Paths B/C may copy |
| v0.1.3 | render() wired through Path A’s per-row loop + _crosswalker provenance + user_preserve merge | Path A’s bundled engine becomes spec-conformant |
| v0.1.4 | kind discriminator + native-recipe entry point + STRM enforcement gate | All 3 Tier 1 shapes (concept-note / junction-note / crosswalk-edge) producible via Path A |
| v0.1.5 (next) | .crosswalker.sqlite sidecar projection from Tier 1 | The Tier 1 → Tier 2 projector (deletable, recoverable) |
| v0.1.7 | Tier 1 → STRM TSV / OSCAL JSON / SSSOM TSV | The round-trip-determinism boundary; round-trip via the schema, not a separate format |
Path B (external Python CLI) is not in v0.1 — it’s deferred to v0.5+ per Ch 23. When it ships, it will write Tier 1 directly via the same spec/tier1.schema.json. The architectural shape doesn’t change; just another producer joins the ecosystem.
What this clarification changes
Section titled “What this clarification changes”| Surface | Change |
|---|---|
concepts/etl-and-import | New section “The three producer paths — and what’s NOT a tier” inserted after the schema-as-primitive reframe; new subsection “How Path A works inside the plugin (v0.1 implementation)” with the per-row pipeline diagram; Related section beefed up with all 6 milestones + this log + Ch 23 synthesis + research deliverables |
| Future user-facing surfaces | Avoid using ParsedData in any user-facing prose. It’s an implementation term internal to Path A. User docs talk about “your CSV/XLSX file” and “Tier 1 vault” — never about an intermediate format |
Why the user’s confusion was reasonable
Section titled “Why the user’s confusion was reasonable”The mental model “ETL produces a normalized format the engine consumes” is the standard mental model in industry. Most ETL tools work that way:
- dbt: source → staging models → marts → reporting
- Airbyte: source → normalized JSON → destination
- Singer: source → SCV (Singer schema spec) → target
Crosswalker deliberately doesn’t. The schema-as-primitive commitment skips the intermediate-format step. This is unusual and the doc surface should accommodate readers who arrive with the standard mental model — calling out explicitly “we don’t have an intermediate format; here’s why” rather than letting them assume one exists and feel confused when it doesn’t appear.
The folder-tag-sync user’s complementary project — where bidirectional bijectivity requires a complex named-slot path-template language — primed the user’s expectation. In folder-tag-sync, the templating IS the contract because both sides have to recover from each other. In Crosswalker, the templating is a one-way projection mechanism inside Path A; the contract is Tier 1 itself. Different problem.
Related
Section titled “Related”Concept pages updated:
- ETL and import — new section with the three producer paths + ParsedData clarification + per-row pipeline diagram + comprehensive Related section bridging to milestones
Concept pages cross-referenced:
- Hierarchy primitives — the 5 mechanisms render() dispatches over
- Terminology — Tier 1, recipe, render(), CURIE, ParsedData (Path A only), provenance
- What makes Crosswalker unique — schema-as-primitive is the differentiator
- Embedded vs server substrates
- Ontology evolution
Agent context:
Implementation milestones (Path A):
- v0.1.1 — validation foundation
- v0.1.2 — render() v1
- v0.1.3 — engine integration
- v0.1.4 — kind discriminator + STRM enforcement
- v0.1.5 — Tier 2 sidecar (next milestone)
- v0.1.7 — round-trip exporters
Design decisions (synthesis logs):
- 2026-05-04 import-engine design log — six architectural commitments
- Ch 22 synthesis (target-structure expressivity) — recipe grammar + render() signature
- Ch 23 synthesis (bundle/engine/language) — Path A v0.1 / Path C v0.5+; runtime-agnostic recipe schema
- Ch 24 synthesis (Tier 2 substrate) — sqlite-wasm
- Ch 20 synthesis (formal transformation algebra)
Delivery logs:
Research deliverables:
Spec files:
spec/tier1.schema.json— the contractspec/recipe.schema.json— recipe shape Path A consumes