🚧 Early alpha — building the foundation. See the roadmap →

ETL and import — schema as primitive, ETL as convenience

Updated Jun 1, 2026

The import side is the hardest part of Crosswalker. Everything a user wants to do — query, crosswalk, evidence, version, share — flows through whatever they already have, in whatever shape it arrives, and lands in the Tier 1 schema. Get the import boundary right and the rest of the architecture works. Get it wrong and no amount of cleverness downstream rescues it.

This page frames how Crosswalker thinks about that boundary. For the system-wide picture (all six layers, not just import), see the system architecture page.

The reframe — schema is the primitive, ETL is convenience

Most ingestion tools start with the engine and treat the output schema as a configuration knob. Crosswalker inverts that:

The Tier 1 schema is the primitive. Crosswalker is, before anything else, an ingestion target — a precise, machine-readable contract that says what canonical Markdown + frontmatter + folder layout + wikilinks looks like. The engine that produces conforming output is convenience. Anyone can produce conforming output any way they like.

Architectural precedents for this stance: HTML, JSON Schema, OpenAPI, SBOM (SPDX/CycloneDX), CSVW, Markdown itself. Each is a receiving format — the spec is the load-bearing artifact; the producer ecosystem grew around it.

Translated to Crosswalker:

Layer	Status
Tier 1 schema (the contract)	Load-bearing primitive. Machine-readable JSON Schema.
Bundled ETL engine (the convenience)	Optional implementation that produces Tier 1 from common source shapes.
External producers (custom code, dbt, Python, scrapers, MCP servers)	First-class citizens. Anything that emits valid Tier 1 is welcome.
Community marketplace (pre-transformed bundles)	Once an ontology is shaped to Tier 1, it stays shaped. Communities share `.zip`s of Tier 1 directories.

This reframe collapses several open architectural questions (build vs buy, DSL choice, in-plugin vs external runtime, recipe location) into one commitment. The engine becomes swappable; the schema does not.

The two-mode architecture — bundled projector or direct emission

Updated 2026-05-05 to clarify how external ETL tools (ChunkyCSV, JSONaut, Polars/DuckDB scripts, dbt, etc.) compose with the bundled engine. See the 2026-05-05 two-mode architecture decision log for the full rationale.

The bundled engine has one job: take structured rows + a recipe, and emit Tier 1 Markdown. That’s it. Any producer — including users with no external tooling, ChunkyCSV pipelines, dbt projects, AI agents, MCP servers — chooses between two entry points:

Mode	Entry point	What the producer does	Streaming responsibility
Mode 1 — Bundled projector	Hand structured rows (CSV/JSON/XML/XLSX) + recipe to the bundled engine	Produces a structured input the engine knows how to iterate	Bundled engine streams per-row through `render()` → write → discard. Producer can also stream upstream (ChunkyCSV → CSV file) — both layers stream.
Mode 2 — Direct emission	Bypass the bundled engine; write Tier 1 Markdown files into the vault	Produces Tier 1 Markdown end-to-end	Producer’s responsibility entirely (typical: per-concept Markdown emission already streams naturally)

Both modes are first-class architectural citizens per the schema-as-primitive commitment. The schema (Tier 1) is the contract; the engine is convenience.

                            ┌─────────────────────┐
   USER SOURCE DATA  ───►   │  Producer of choice │
   (any shape, any size)    │  (external or inline) │
                            └──────────┬──────────┘
                                       │
                ┌──────────────────────┴──────────────────────┐
                │                                             │
                ▼                                             ▼
    ┌─ MODE 1 — Bundled projector ─────┐    ┌─ MODE 2 — Direct emission ─────┐
    │  Producer emits:                  │    │  Producer emits:                │
    │   structured rows                 │    │   Tier 1 Markdown directly      │
    │   (CSV / JSON / XML / XLSX)       │    │                                 │
    │                                   │    │  Examples:                      │
    │  Examples:                        │    │   • AI agent writes per-concept │
    │   • User uploads raw CSV in       │    │     .md files                   │
    │     wizard (engine parses)        │    │   • Python script renders YAML  │
    │   • ChunkyCSV streams huge XLSX → │    │     + Markdown end-to-end       │
    │     emits cleaned CSV             │    │   • Pre-built vault bundle      │
    │   • JSONaut transforms messy JSON │    │     downloaded from marketplace │
    │     → emits normalized JSON       │    │                                 │
    │   • dbt models project SQL → CSV  │    │  Bundled engine: not invoked    │
    │                                   │    │                                 │
    │  Bundled engine then:             │    │                                 │
    │    parses input (streaming) →     │    │                                 │
    │    iterates rows →                │    │                                 │
    │    render() per row →             │    │                                 │
    │    writes Tier 1 file →           │    │                                 │
    │    discards row from RAM          │    │                                 │
    └──────────────────┬────────────────┘    └─────────────────┬───────────────┘
                       │                                       │
                       └─────────────────┬─────────────────────┘
                                         │
                                         ▼
                        ┌────────────────────────────────────┐
                        │         TIER 1 VAULT               │
                        │  Markdown + YAML frontmatter        │
                        │  conforming to                      │
                        │  spec/tier1.schema.json             │
                        │  ────────────────────────           │
                        │  THE LOAD-BEARING CONTRACT          │
                        └────────────────────────────────────┘

Why both modes matter

Mode 1 lets ChunkyCSV / JSONaut / dbt / Polars scripts compose naturally. They already produce CSV/JSON. Crosswalker just consumes that. The user doesn’t have to teach their existing ETL tools how to emit Markdown — they keep doing what they’re good at, and the bundled engine handles the projection layer.

Mode 2 lets AI agents and bundled-marketplace publishers emit Tier 1 end-to-end. When an agent extracts an ontology from a corpus, it produces concepts directly — there’s no natural intermediate “rows” representation. Skipping the bundled engine is the right call.

Streaming is at the engine boundary, not before it

The bundled engine is streaming-by-design (per v0.1.4.5 streaming refactor): it accepts rows as an iterator/async-iterator, calls render() per row, writes the file, and discards the row before reading the next one. The full source dataset never exists in RAM at once.

This is why Mode 1 + ChunkyCSV / JSONaut composes well: ChunkyCSV streams from a multi-gigabyte source, emits a smaller-but-still-streaming CSV, the bundled engine streams through that without accumulating, and the resulting Tier 1 vault is materialized on disk file-by-file. End-to-end streaming pipeline.

What “structured” actually means — the Mode 1 input contract

A common follow-up question once the two-mode architecture clicks: how does the bundled engine know an input is “clean enough”? Is there an encoding requirement? A specific JSON shape? An XML schema?

Honest answer: the contract is shape-level, not format-level. The engine consumes an iterable of records. Each record is {columnName: value}. Columns are referenced by name in the recipe’s template strings ({control_id}, {family.title}). That’s the entire contract.

The bundled engine doesn’t validate the input against a “Tier 0.5 schema” because no Tier 0.5 schema exists (2026-05-05 two-mode architecture decision §“No Tier 0.5” confirms this). The engine just calls render(recipe, identityFromRow) on each record and writes the resulting Tier 1 file. Per-row failures (missing template variable, malformed encoding, validation rejection) become per-row errors; the import continues.

Shape requirements

// The engine consumes this:
interface RowSource {
    columns: string[];                              // header — column names referenced by recipe
    rows: Iterable<Row> | AsyncIterable<Row>;       // the records
    rowCount?: number;                              // optional; -1 or omitted if streaming
}

interface Row {
    [columnName: string]: string | number | boolean | null | undefined | object;
    // Nested values OK — recipe templates use dotted access: {obj.field.subfield}
}

Required:

Stable columns across records — every row has the same set of keys (missing keys default to undefined, which produces empty values)
Recipe-referenced columns are present — if a template uses {control_id}, the source must have a control_id column
Values addressable by recipe templates — flat strings are simplest; nested objects work via dotted access; arrays need iterator declarations in the recipe
Encoding: UTF-8 — every text format (CSV, JSON, XML, XLSX-as-CSV-export) should arrive UTF-8

That’s it. No strict schema validation upstream of the engine.

Format-by-format guidance for Mode 1 producers

Format	When to use	What “clean enough” looks like
CSV	Tabular, flat, common ETL output, Excel exports, ChunkyCSV pipes	UTF-8; first row is header; consistent delimiter (PapaParse auto-detects `,`, `;`, `\t`, `\|`); RFC 4180 quoting for embedded commas/quotes; one logical record per row; no merged cells
JSON (array of records)	Structured records, optionally nested, JSONaut output, dbt JSON exports	UTF-8; top-level array of objects; consistent record shape; nested objects addressable via `{a.b.c}` templates
JSON (single object with array property)	OSCAL bundles, deeply nested ontologies	UTF-8; recipe declares the iterator path (e.g., `source.iterator: $.catalog.controls[*]` — JSONata syntax in v0.1+)
XLSX	Excel-native sources (NIST 800-53 catalog as published)	First row is header; one sheet (recipe specifies which); flat cells — no merged cells, no formula references in cell values, no embedded sub-tables
XML / RDF / OWL	RDF/OWL ontologies, ISO XML feeds	Not v0.1. Pre-convert to CSV or JSON via external tooling (the user’s SEACOW / JSONaut handle this). Native XML/RDF parsing deferred to v0.2+.

Is JSON “the way to go” over CSV?

Depends on shape. CSV is fine — and often better — when:

The source is naturally tabular (compliance frameworks: control_id, family, title, baseline, … fit a CSV grid perfectly)
You’re already producing CSV via Excel export, ChunkyCSV pipes, or SQL-warehouse export
Streaming matters (CSV streams trivially line-by-line; JSON streaming requires stream-json or similar)
File size is moderate-to-large (CSV is denser than JSON for tabular data — no key repetition per record)

JSON is better when:

Records have nested objects ({ control: { id, title, subcontrols: [...] } })
One source contains multiple record types (OSCAL bundle = catalog + controls + groups + parameters + back-matter — recipes pick which to iterate)
Values are naturally arrays (tags, parent CURIEs, related-control lists)
The source’s authoritative published format is JSON (OSCAL, JSON-LD ontologies, MITRE STIX exports)

For NIST 800-53 r5: CSV is fine (columns: control_id, family, title, baseline, …). For NIST OSCAL JSON catalogs: JSON is native. For ISO 27001 controls: CSV if you have a tabular export; XLSX otherwise. For MITRE ATT&CK: JSON (STIX) is native; CSV exports also exist.

The recipe references columns by name regardless of format. The bundled engine has parsers for each. Format choice is a producer-side decision; the engine adapts.

Why no upfront schema validation?

Two reasons:

Adding a schema for the input format would create a Tier 0.5 contract that producers have to satisfy. That violates the schema-as-primitive commitment (Tier 1 is the only contract). External producers would have to learn two schemas instead of one.
Per-row error handling is cleaner than upfront rejection. If a 50,000-row CSV has 3 malformed rows, the engine processes 49,997 rows and reports the 3 failures. An upfront schema would either reject the whole file or silently filter — both worse outcomes.

The schema validation that does happen is at the output boundary: every row’s rendered Tier 1 frontmatter is validated against spec/tier1.schema.json before being written (v0.1.4 strict-mode validation). Bad rows produce errors; good rows produce files. The contract is enforced where it matters.

Why ChunkyCSV / JSONaut are the right partners for messy sources

The user’s existing tools fit this gap precisely. They specialize in input cleanup — turning a multi-GB messy XLSX into a streaming UTF-8 CSV with stable columns and proper RFC 4180 quoting. The bundled engine specializes in projection — turning structured rows into Tier 1 vault notes per a recipe. Each layer does what it’s good at; the boundary between them is “an iterable of records with stable columns.” That’s the entire contract — and it’s intentionally narrow so external tools don’t need Crosswalker-specific knowledge.

Transform engine depth — where the in-plugin work stops

A natural follow-on question once Mode 1’s input contract is clear: how much of the messy-source-cleanup problem does Crosswalker try to solve in-plugin? Decided 2026-05-05 (see transform-engine-depth and input-formats decision log + Ch 26):

NARROW                                                                 BROAD
  │                                                                       │
  ▼                                                                       ▼
v0.1 closed     v0.2 wizard     v0.3 inline       v0.5 port jsonaut/    v1.0 full
filter set      adds basic      JSONata expr     chunkycsv as opt-in    in-plugin
(7 filters,     transforms      sub-language     transform engine       transform IDE
shipped)        (rename,        (already                                 (REJECTED —
                trim,           committed in                             scope drift;
                regex,          Ch 23 §6;                                could exceed
                split)          just wires)                              Crosswalker
                                                                         itself)

                                       ▲                                    ▲
                                       │                                    │
                                  CHOSEN                              REJECTED
                                  STOPPING                            (likely
                                  POINT                               permanently)

Phase	Wizard offers	Recipe author authors	External tool handles
v0.1 (shipped)	Column-role config	Closed 7-filter templates	Everything else
v0.2	+ Column rename, value trim, regex extract, simple split	Same templates	Conditional logic, joins, lookups, flat-to-tree, fuzzy matching
v0.3	+ JSONata expression cells for advanced users	+ JSONata 2.x expressions inline (`{(baseline = "HIGH") ? "high/" : "mod/"}{control_id}.md`)	Multi-source joins, time-series, complex pipelines

Why we don’t port JSONaut / ChunkyCSV: JSONata 2.x is a TS-native, well-maintained library covering ~80% of JSONaut’s declarative-transformation feature set. JSONata is the Ch 23 §6 commitment. ChunkyCSV’s CSV-streaming + multi-GB-cleanup features stay external — that’s literally what ChunkyCSV is for, and forcing it into the plugin would be redundant. Both stay first-class Mode 1 feeders.

Why we don’t build a transform IDE: an in-plugin transform IDE would try to bundle live preview, debugging, profiling, output diff, joins UI, conditional logic UI into one experience. Each piece has a simpler v0.2/v0.3/external alternative (wizard preview, JSONata playground, generation-engine error reports, git diff, JSONata lookup{} syntax). Building the IDE is huge scope drift — could exceed Crosswalker itself.

v0.1 → v0.3+ input format roster

The bundled engine’s input formats and the rationale per format (decided 2026-05-05):

Format	Phase	Why	Stream?
CSV	v0.1 (shipped)	Universal; PapaParse handles encoding + delimiter auto-detect; ChunkyCSV’s natural output	✅ via PapaParse step + v0.1.4.5 streaming refactor
JSONL (newline-delimited JSON)	v0.2	Stream-friendly (line-by-line); native types + nesting; produced by BigQuery, Spark, Databricks, dbt, JSONaut natively	✅ trivial — split on `\n`, JSON.parse per line
JSON with iterator path	v0.2	OSCAL bundles, deeply-nested ontologies; recipe declares `source.iterator: $.catalog.controls[*]` (JSONata path)	✅ via stream-json
XLSX	v0.3+ (already partial)	Excel-native sources (NIST 800-53 r5 catalog ships as XLSX)	⚠ sheet-by-sheet via xlsx package
XML / RDF / OWL	v0.3+ if demand justifies	RDF/OWL ontologies, ISO XML feeds	⚠ requires sax-style streaming; defer to v0.3+

JSONL is the v0.2 priority because it’s the genuine midway between “raw user CSV” and “fully-cleaned-input-ready-for-the-bundled-engine”:

Format	Streamable	Schema fidelity	ETL ecosystem	Complexity to integrate
CSV	✅ Trivial	❌ Strings only	✅ Universal	Done
JSONL	✅ Trivial	✅ Native types + nesting	✅ Common (modern data warehouses)	Low (~50 LOC)
Plain JSON array	❌ Whole-array parse	✅ Native	✅ Common	Medium
JSON with iterator	✅ via stream-json	✅ Native + multi-iterator	Limited	Medium
XLSX	⚠ Sheet-by-sheet	⚠ Type-mixed	✅ Excel-native	Medium

JSONL is better than CSV for: nested objects, native types, nullable fields. Better than plain JSON for: streaming, no whole-file parse. ETL ecosystem alignment (modern data warehouses emit JSONL natively). ~50 LOC implementation. Big composability win.

For JSON with iterator path (single object containing an array, like OSCAL): support via recipe source.iterator: $.catalog.controls[*] (JSONata-shaped path expression). Engine uses stream-json to navigate lazily.

What `ParsedData` is (and isn’t)

The bundled engine’s TypeScript code uses a small interface called ParsedData for the wizard’s preview step:

interface ParsedData {
    columns: string[];
    rows: Row[] | AsyncIterable<Row>;  // streaming-friendly
    rowCount?: number;
}

Concept	What it IS	What it ISN’T
`ParsedData`	An in-memory TypeScript interface used by the wizard for preview + by the engine as the “structured rows” input shape. May wrap a complete array (small data) OR an async iterator (large data).	A persisted intermediate file format. NOT a tier. NOT something external producers consume or emit. (External producers emit CSV/JSON/XML/XLSX — formats the engine knows how to parse — or Tier 1 Markdown directly.)
Tier 1	The canonical Markdown vault format on disk, conforming to `spec/tier1.schema.json`. The load-bearing contract.	A serialization-only artifact — it’s the shared vocabulary every producer (bundled engine, external CLI, AI agent) must produce.

ParsedData is just how the engine internally represents “the rows to iterate.” It’s not part of the architectural contract.

How Mode 1 works inside the plugin (v0.1 implementation)

The bundled engine’s per-row pipeline (after Ch 22 + Ch 23 + Ch 24 design phase, implemented in v0.1.2 + v0.1.3 + v0.1.4, with streaming wired in v0.1.4.5):

Row iterator                    ◄── streaming source
(AsyncIterable<Row>)                  ChunkyCSV pipe / PapaParse step /
   │                                   AsyncIterator<JSON record> / etc.
   │  (one row pulled at a time)
   ▼
Row n
   │
   ▼
ConceptIdentity {curie, scope}  ◄── scope = the row itself
   │
   ▼  + Recipe
render(Recipe, Identity)        ◄── pure function (Ch 22 §3)
   │                                 single coupling point
   ▼
Address {                       ◄── what render() returns
  primary: {path, anchor?},
  wikilinkTarget,
  tags[], aliases[],
  frontmatter (managed)
}
   │
   ▼  + buildProvenance() + body
validateTier1Frontmatter()      ◄── pre-write gate (v0.1.4)
   │                                STRM enforcement here
   ▼
mergeFrontmatter(existing, new) ◄── user_preserve survives (Ch 22 §8.4)
   │
   ▼
app.vault.create() / .modify()  ◄── Tier 1 file written
   │
   ▼
Row n discarded from RAM        ◄── streaming complete; back for row n+1

render() is the single coupling point between recipe and vault layout. Pass 1 is vault-independent (deterministic, hashable, replayable). Per Ch 22 synthesis, this purity is what makes canonical-state hashing work.

Why import is the hardest part

A bullet list of what import has to do correctly, in approximate order of how often it goes wrong:

Format diversity — CSV, XLSX, JSON, YAML, OSCAL, RDF, MCP server, scraped HTML, OneNote, Notion export. Each has its own tabular-vs-tree shape and its own dirtiness.
Tabular-to-tree depth crossing — most messy sources are flat tables that encode a tree (parent-id columns, dotted IDs, prefix conventions). Recovering the tree is non-trivial and source-specific.
Identity stability — the same concept may appear in three sources with three IDs. Tier 1 needs a single canonical identity (CURIE + sha256 CID) per concept; recipe authors decide the canonicalization.
Polysemic target choice — even after the source is parsed, the recipe author has to choose how it lands: folder hierarchy? heading hierarchy? tag hierarchy? wikilink-graph? Some composition? See hierarchy primitives.
Depth control — the same source supports different vault shapes depending on how deep the user wants the hierarchy materialized. NIST CSF 2.0 imported “two layers deep” produces one vault; “all the way to subcategories” produces a different one. Both are valid.
Provenance — Tier 1 needs to record where every concept came from, when, with what version, and (eventually) signed by whom.
Re-import — version bumps must produce a diff against the prior import, not a silent overwrite or a duplicate vault.
Round-trip — exporting Tier 1 back to source-shape (or to OSCAL, or to SSSOM) must work without information loss for the round-trip subset.

No off-the-shelf ETL tool does all of these. Tools come close on subsets — dbt nails reproducibility-and-tabular-transform; RML nails schema-mapping-to-graphs; MCP nails external-source-integration. None target Markdown-vault polyhierarchy with provenance and round-trip as a first-class destination.

This is why the import boundary needs first-principles thinking, not a “pick the popular ETL framework and hope” approach.

The four architectural pieces

Crosswalker’s import space breaks into four orthogonal pieces. They compose; users adopt as much or as little as they want.

1. The Tier 1 schema (machine-readable contract)

Single source of truth for what canonical Crosswalker output looks like. Spelled as JSON Schema (or equivalent — CUE, Dhall) so any external tool can validate against it without bespoke parsers. Contains:

File-naming rules
Frontmatter shape (required + optional fields, types, allowed values)
Folder layout conventions (when applicable)
Wikilink target shape and resolution rules
Provenance fields (source ref, ingestion timestamp, recipe used, content hash)
Identity rules (CURIE format, sha256 CID computation)

Anything that emits valid Tier 1 — produced however — is a Crosswalker-conformant import. The schema is the entire interface.

2. The bundled ETL engine (convenience)

A lightweight recipe runtime built into the plugin (or accessible via CLI) that handles the common 80% of imports without leaving Obsidian:

Common formats out of the box (CSV, XLSX, JSON, YAML)
Closed primitive vocabulary for transforms
Recipe authoring through the import wizard UI
Schema validation on output before write

This is what most users will use most of the time. The engine is not the primitive — it’s just a producer of valid Tier 1, like any other producer. Power users who outgrow it can step into option 3 below without changing how anything downstream of Tier 1 works.

The exact engine implementation (in-plugin TS vs external Python vs hybrid) is open research — see Challenge 23.

3. External producers (custom transforms)

When the bundled engine isn’t enough — messy source, unusual transform, domain-specific cleanup — users run their own toolchain (Python + Polars, dbt, Jupyter notebooks, custom scrapers, MCP servers, AI-agent-driven extraction) and emit Tier 1 directly. They use the schema as the contract.

This path is first-class, not a fallback. Crosswalker explicitly does not try to be the universal ETL engine. It tries to be the universal target.

The user’s own ChunkyCSV and JSONaut are examples of external producers — purpose-built ETL tools that could be configured to emit Tier 1 without being absorbed into Crosswalker’s codebase.

Once an ontology is transformed to Tier 1, it stays transformed. There is no reason every NIST 800-53 user should re-do the import. Crosswalker treats community-shared, pre-transformed Tier 1 bundles as a load-bearing piece of the architecture, not a nice-to-have:

Someone (the maintainer of an ontology, or an early adopter) does the transformation work once.
They publish the resulting Tier 1 directory as a .zip or git repo.
Other users download it and have a working vault in seconds, with no recipe authoring at all.
When the upstream ontology updates, the maintainer publishes a new bundle; downstream users pull a diff.

This pattern is the answer to the messy-source problem. The bundled engine handles tree-shaped sources well (JSON, YAML, OSCAL); the marketplace fills the gap for messy tabular sources (the original NIST 800-53 XLSX, MITRE ATT&CK matrices) by community-shared pre-transformation. A user with NIST CSF 2.0 doesn’t need to fight the spreadsheet — they download the pre-transformed bundle.

Mechanism	Where it lives	Notes
In-repo registry	Subfolder of Crosswalker repo	Tight integration; CI validates every bundle on merge; PRs from community mix with engine PRs
Companion repo	Separate `crosswalker-recipes` (or `crosswalker-bundles`) repo	Cleaner separation; independent versioning; easier for non-engine contributors

Either works; the choice is deferrable. Both are GitHub-based, both are copy-paste-or-clone usable.

The five-axis recipe selection

Even after a source is parsed, every recipe makes five orthogonal choices about how the source lands as Tier 1:

Axis	Question	Examples
Depth	How many levels of the source hierarchy materialize as vault structure?	NIST CSF 2.0: “two layers deep” vs “all the way to subcategories”
Mechanism	Which of the four hierarchy primitives carry the structure?	Folder, heading, tag, wikilink-graph, or composition
Filter	Which subset of the source is imported?	”Only AC family controls” / “only HIGH-baseline controls” / “all”
Granularity	One file per leaf concept, or one file per group with leaves as headings?	Per-control file vs per-family file
Projection	Which fields of each concept land as frontmatter, body, wikilink, or are dropped?	Control text → body; control ID → frontmatter; baseline → tag

These five axes are why “the recipe” is a non-trivial declarative artifact, not a single transform expression. They’re also the reason a single source ontology can produce many legitimately-different vaults — and why Crosswalker can’t just “auto-import” a source without recipe author input.

The ~40-primitive transformation catalog

Underneath the recipe, the bundled engine implements a closed set of transformation primitives. Comprehensive but small. Drawn from RML/YARRRML, FNML, JSONata, dbt, and the user’s prior tooling. Grouped by category:

Category	Primitives (illustrative)
Source iteration	iterate-rows, iterate-records, iterate-tree-nodes, iterate-paths, iterate-grouped
Identity / ID synthesis	curie-from-pattern, sha256-cid, uuid-v7, slugify, normalize-id
Field projection	project, rename, drop, default, coerce-type, parse-date
String transforms	trim, lowercase, uppercase, replace, split, join, regex-extract, regex-replace
Tree-from-flat	parent-id-to-tree, dotted-id-to-tree, prefix-to-tree, indent-to-tree
Joins / lookups	inner-join, left-join, lookup-table, fuzzy-match, alias-resolve
Address rendering	folder-path, heading-anchor, tag-path, wikilink-target
Validation / guard	require-field, allowed-values, type-check, schema-validate
Provenance	record-source-ref, record-version, record-timestamp, record-hash

A recipe is a pipeline (or DAG) of these primitives. Composition is what makes the engine complete enough — Macro Tree Transducer theory (per Ch 20 deliverable A) tells us roughly 5–6 algebraic operators are sufficient; the ~40 above unfold those operators into named, recipe-author-friendly forms.

The catalog is not finalized. It is the active design surface.

YARRRML, explained simply

Several of the Ch 20 deliverables recommended YARRRML as the surface DSL for recipes. This is a real choice the user has flagged for further evaluation. Here’s the plain-English version:

YARRRML is a YAML format for writing data-mapping recipes. You write a YAML file that says “this column in my CSV maps to this property in the output, with this transform.” That’s it.

The complicated-looking name comes from “YAML for RDF Mapping Language” — its original use was generating RDF triples. But the structure (sources / mappings / transforms / sinks) is widely useful even if you ignore the RDF heritage and retarget to Markdown notes.

A toy YARRRML-shaped recipe (not real syntax — illustrative):

sources:
  controls: { type: csv, file: NIST_800-53_r5.csv }

mappings:
  control:
    iterator: $.controls.rows
    target:
      file: "Frameworks/NIST 800-53 r5/{{family}}/{{control_id}}.md"
      frontmatter:
        title: "{{control_id}} — {{control_name}}"
        family: "{{family}}"
        baseline: "{{baseline}}"
        tags:
          - "framework/nist-800-53-r5"
          - "framework/nist-800-53-r5/{{family|lowercase}}"
      body: |
        ## Control text
        {{control_text}}

Five things to notice:

Sources — declare inputs once, by type and location.
Mappings — for each row/record/node, declare the output target.
Templates — {{...}} interpolates fields with optional filters (|lowercase).
Sinks — file, frontmatter, body, tags. Closed vocabulary; same vocabulary covers any domain.
No code — this is data, not Python. An AI agent can read, generate, modify, and validate it.

Why it’s a strong candidate: 12+ years of refinement; mature tooling (Matey editor, parsers); tree-shaped recipes are easier to validate, diff, version, and share than imperative scripts. Anyone — author, agent, validator — can reason about a YAML file mechanically.

Why it’s not a foregone conclusion: YARRRML’s RDF heritage may bring vocabulary baggage; retargeting to Markdown sinks needs a dialect; a custom Crosswalker-native DSL might fit better; or JSON Schema + plain JSONata might be enough without the YARRRML wrapper.

A dedicated explainer page lives at agent-context/agent-tooling/yarrrml-explained (TODO — coming as the Ch 23 research and recipe-DSL design firm up).

How this connects to the rest of Crosswalker

Edge-level concerns (STRM predicates, SSSOM rows, junction notes, ontology diff atoms) operate on concept identity and don’t care which import mechanism produced the vault. The single coupling point is the address-rendering function that turns a concept identity into a wikilink target string given the recipe’s hierarchy-primitive choices. See hierarchy primitives for that separation.

This means the import-side architecture (this page) is strictly orthogonal to the edge-side architecture. A user could import the same NIST 800-53 catalog with three completely different recipes and three completely different vault layouts; the resulting STRM/SSSOM/junction-note structure would be semantically identical.

Where this came from — broader portfolio context

Crosswalker’s stance on import — “schema as primitive, ETL as convenience, marketplace as community escape hatch” — emerged from a portfolio of related tools the user has built over time, all aimed at applied information science with Obsidian as the platform:

ChunkyCSV — CSV-shaped tabular ETL with tree transducer flavor. Precedent for the “tabular sources need first-class tree-recovery” axis.
JSONaut — declarative JSON transformation. Precedent for the “external producers should be able to emit Tier 1 without bespoke Crosswalker code” axis.
SEACOW — meta-framework for knowledge organization in filesystem primitives. Precedent for the “primary folder + parallel tag hierarchy” composition pattern.
folder-tag-sync — Obsidian plugin that bidirectionally syncs folder hierarchy with tag hierarchy. Precedent for live composition rules between two of the four hierarchy primitives.

Crosswalker is not these tools rebranded — it is the general ingestion target their patterns point at. The marketplace pattern in particular is the answer to the question that ChunkyCSV and JSONaut individually pose: “we keep doing this transform every time someone imports the same source — why?”

Open design questions

Items the architecture has named but not yet committed:

Question	Default lean	Status
Bundle engine implementation language	Pure TypeScript in-plugin (Path A) for v0.1; Hybrid (Path C — opt-in external Python producer) for v0.5+	✅ RESOLVED 2026-05-04 — see Ch 23 synthesis log. Mobile-Obsidian portability + small-OSS contributor pool forced the verdict
Surface DSL flavor	YARRRML-shaped (declarative); v0.1 may ship a simpler subset	Open
Marketplace mechanism	Either in-repo registry or companion repo; both viable	Deferrable
Hardcoded vs declarative dialect spec	Declarative (the dialect IS data even when only one ships)	Provisionally settled
`target_schema` as JSON Schema (machine-readable) vs prose	Machine-readable	Settled
Recipe schema runtime-agnostic	JSON Schema + AJV; JSONata 2.x as expression sub-language; engine implementations swappable without breaking user files	✅ Settled 2026-05-04 per Ch 23 §4 — the single most important architectural commitment of the design phase
External-producer protocol surface (push-into-Crosswalker via MCP)	Defer until Tier 1 schema lands	Open

See the 2026-05-04 import engine design log for the canonical state of these decisions.

Concept pages:

Hierarchy primitives — folder, file, heading, tag, wikilink and how recipes compose them via the 5-mechanism grammar
The problem — hierarchy-vs-graph tension; why ingestion target matters
What makes Crosswalker unique — Spec / Library / Integrations pillar that this page activates
Embedded vs server substrates — Tier 2 sidecar projection from Tier 1
Ontology evolution — re-import semantics + provenance
Terminology — definitions: Tier 1, recipe, render(), CURIE, ParsedData (Path A only), provenance

Agent context:

v0.1 schema spec — current Tier 1 contract; the load-bearing primitive this page describes
Vision — schema-as-primitive is the central architectural commitment
Tradeoffs — convenience-vs-canonical, bundled-engine-vs-external-producer

Implementation milestones (Path A):

v0.1.1 — Type system + validation foundation — AJV wires the schema into runtime
v0.1.2 — render() v1 — the single coupling point ships
v0.1.3 — Generation engine integration — render() wired through the bundled engine
v0.1.4 — Junction notes + crosswalk edges — kind discriminator + STRM enforcement at the validation gate
v0.1.5 — Tier 2 sqlite-wasm sidecar projector — Tier 1 → Tier 2 projection (queries layer)
v0.1.7 — Exporters — Tier 1 → STRM TSV / OSCAL JSON / SSSOM TSV (round-trip determinism)

Design decisions (synthesis logs):

2026-05-05 ETL pipeline clarification — what ParsedData is + isn’t; the three producer paths drawn explicitly
2026-05-04 import-engine design log — canonical state of import-engine commitments
Ch 22 synthesis (target-structure expressivity) — closed 5-mechanism recipe grammar; render() signature
Ch 23 synthesis (bundle/engine/language) — Path A for v0.1, Path C for v0.5+; runtime-agnostic recipe schema (the most important modularity commitment)
Ch 24 synthesis (Tier 2 substrate) — sqlite-wasm + sqlite-vec
Ch 20 synthesis log (formal transformation algebra) — the wargaming setup
v0.1.3 delivery log
v0.1.4 delivery log

Research challenges (resolved → archive):

Ch 21 — Build vs buy ETL engine — meta-question above engine implementation
Ch 23 — Bundle engine language (archived) — resolved 2026-05-04: Path A for v0.1, Path C for v0.5+

Research deliverables:

Ch 20 deliverable a (T1TMA primitives)
Ch 20 deliverable c (RML retargeted) — R2RML lineage of the {var|filter} template grammar
Ch 22 deliverable (target-structure expressivity)
Ch 23 deliverable (bundle/engine/language)

Spec files:

spec/tier1.schema.json — the contract this page describes
spec/recipe.schema.json — recipe shape Path A consumes