Skip to content
🚧 Early alpha — building the foundation. See the roadmap →

ETL and import — schema as primitive, ETL as convenience

Updated

The import side is the hardest part of Crosswalker. Everything a user wants to do — query, crosswalk, evidence, version, share — flows through whatever they already have, in whatever shape it arrives, and lands in the Tier 1 schema. Get the import boundary right and the rest of the architecture works. Get it wrong and no amount of cleverness downstream rescues it.

This page frames how Crosswalker thinks about that boundary. For the system-wide picture (all six layers, not just import), see the system architecture page.

The reframe — schema is the primitive, ETL is convenience

Section titled “The reframe — schema is the primitive, ETL is convenience”

Most ingestion tools start with the engine and treat the output schema as a configuration knob. Crosswalker inverts that:

The Tier 1 schema is the primitive. Crosswalker is, before anything else, an ingestion target — a precise, machine-readable contract that says what canonical Markdown + frontmatter + folder layout + wikilinks looks like. The engine that produces conforming output is convenience. Anyone can produce conforming output any way they like.

Architectural precedents for this stance: HTML, JSON Schema, OpenAPI, SBOM (SPDX/CycloneDX), CSVW, Markdown itself. Each is a receiving format — the spec is the load-bearing artifact; the producer ecosystem grew around it.

Translated to Crosswalker:

LayerStatus
Tier 1 schema (the contract)Load-bearing primitive. Machine-readable JSON Schema.
Bundled ETL engine (the convenience)Optional implementation that produces Tier 1 from common source shapes.
External producers (custom code, dbt, Python, scrapers, MCP servers)First-class citizens. Anything that emits valid Tier 1 is welcome.
Community marketplace (pre-transformed bundles)Once an ontology is shaped to Tier 1, it stays shaped. Communities share .zips of Tier 1 directories.

This reframe collapses several open architectural questions (build vs buy, DSL choice, in-plugin vs external runtime, recipe location) into one commitment. The engine becomes swappable; the schema does not.

The two-mode architecture — bundled projector or direct emission

Section titled “The two-mode architecture — bundled projector or direct emission”

Updated 2026-05-05 to clarify how external ETL tools (ChunkyCSV, JSONaut, Polars/DuckDB scripts, dbt, etc.) compose with the bundled engine. See the 2026-05-05 two-mode architecture decision log for the full rationale.

The bundled engine has one job: take structured rows + a recipe, and emit Tier 1 Markdown. That’s it. Any producer — including users with no external tooling, ChunkyCSV pipelines, dbt projects, AI agents, MCP servers — chooses between two entry points:

ModeEntry pointWhat the producer doesStreaming responsibility
Mode 1 — Bundled projectorHand structured rows (CSV/JSON/XML/XLSX) + recipe to the bundled engineProduces a structured input the engine knows how to iterateBundled engine streams per-row through render() → write → discard. Producer can also stream upstream (ChunkyCSV → CSV file) — both layers stream.
Mode 2 — Direct emissionBypass the bundled engine; write Tier 1 Markdown files into the vaultProduces Tier 1 Markdown end-to-endProducer’s responsibility entirely (typical: per-concept Markdown emission already streams naturally)

Both modes are first-class architectural citizens per the schema-as-primitive commitment. The schema (Tier 1) is the contract; the engine is convenience.

                            ┌─────────────────────┐
   USER SOURCE DATA  ───►   │  Producer of choice │
   (any shape, any size)    │  (external or inline) │
                            └──────────┬──────────┘

                ┌──────────────────────┴──────────────────────┐
                │                                             │
                ▼                                             ▼
    ┌─ MODE 1 — Bundled projector ─────┐    ┌─ MODE 2 — Direct emission ─────┐
    │  Producer emits:                  │    │  Producer emits:                │
    │   structured rows                 │    │   Tier 1 Markdown directly      │
    │   (CSV / JSON / XML / XLSX)       │    │                                 │
    │                                   │    │  Examples:                      │
    │  Examples:                        │    │   • AI agent writes per-concept │
    │   • User uploads raw CSV in       │    │     .md files                   │
    │     wizard (engine parses)        │    │   • Python script renders YAML  │
    │   • ChunkyCSV streams huge XLSX → │    │     + Markdown end-to-end       │
    │     emits cleaned CSV             │    │   • Pre-built vault bundle      │
    │   • JSONaut transforms messy JSON │    │     downloaded from marketplace │
    │     → emits normalized JSON       │    │                                 │
    │   • dbt models project SQL → CSV  │    │  Bundled engine: not invoked    │
    │                                   │    │                                 │
    │  Bundled engine then:             │    │                                 │
    │    parses input (streaming) →     │    │                                 │
    │    iterates rows →                │    │                                 │
    │    render() per row →             │    │                                 │
    │    writes Tier 1 file →           │    │                                 │
    │    discards row from RAM          │    │                                 │
    └──────────────────┬────────────────┘    └─────────────────┬───────────────┘
                       │                                       │
                       └─────────────────┬─────────────────────┘


                        ┌────────────────────────────────────┐
                        │         TIER 1 VAULT               │
                        │  Markdown + YAML frontmatter        │
                        │  conforming to                      │
                        │  spec/tier1.schema.json             │
                        │  ────────────────────────           │
                        │  THE LOAD-BEARING CONTRACT          │
                        └────────────────────────────────────┘

Mode 1 lets ChunkyCSV / JSONaut / dbt / Polars scripts compose naturally. They already produce CSV/JSON. Crosswalker just consumes that. The user doesn’t have to teach their existing ETL tools how to emit Markdown — they keep doing what they’re good at, and the bundled engine handles the projection layer.

Mode 2 lets AI agents and bundled-marketplace publishers emit Tier 1 end-to-end. When an agent extracts an ontology from a corpus, it produces concepts directly — there’s no natural intermediate “rows” representation. Skipping the bundled engine is the right call.

Streaming is at the engine boundary, not before it

Section titled “Streaming is at the engine boundary, not before it”

The bundled engine is streaming-by-design (per v0.1.4.5 streaming refactor): it accepts rows as an iterator/async-iterator, calls render() per row, writes the file, and discards the row before reading the next one. The full source dataset never exists in RAM at once.

This is why Mode 1 + ChunkyCSV / JSONaut composes well: ChunkyCSV streams from a multi-gigabyte source, emits a smaller-but-still-streaming CSV, the bundled engine streams through that without accumulating, and the resulting Tier 1 vault is materialized on disk file-by-file. End-to-end streaming pipeline.

What “structured” actually means — the Mode 1 input contract

Section titled “What “structured” actually means — the Mode 1 input contract”

A common follow-up question once the two-mode architecture clicks: how does the bundled engine know an input is “clean enough”? Is there an encoding requirement? A specific JSON shape? An XML schema?

Honest answer: the contract is shape-level, not format-level. The engine consumes an iterable of records. Each record is {columnName: value}. Columns are referenced by name in the recipe’s template strings ({control_id}, {family.title}). That’s the entire contract.

The bundled engine doesn’t validate the input against a “Tier 0.5 schema” because no Tier 0.5 schema exists (2026-05-05 two-mode architecture decision §“No Tier 0.5” confirms this). The engine just calls render(recipe, identityFromRow) on each record and writes the resulting Tier 1 file. Per-row failures (missing template variable, malformed encoding, validation rejection) become per-row errors; the import continues.

// The engine consumes this:
interface RowSource {
    columns: string[];                              // header — column names referenced by recipe
    rows: Iterable<Row> | AsyncIterable<Row>;       // the records
    rowCount?: number;                              // optional; -1 or omitted if streaming
}

interface Row {
    [columnName: string]: string | number | boolean | null | undefined | object;
    // Nested values OK — recipe templates use dotted access: {obj.field.subfield}
}

Required:

  1. Stable columns across records — every row has the same set of keys (missing keys default to undefined, which produces empty values)
  2. Recipe-referenced columns are present — if a template uses {control_id}, the source must have a control_id column
  3. Values addressable by recipe templates — flat strings are simplest; nested objects work via dotted access; arrays need iterator declarations in the recipe
  4. Encoding: UTF-8 — every text format (CSV, JSON, XML, XLSX-as-CSV-export) should arrive UTF-8

That’s it. No strict schema validation upstream of the engine.

Format-by-format guidance for Mode 1 producers

Section titled “Format-by-format guidance for Mode 1 producers”
FormatWhen to useWhat “clean enough” looks like
CSVTabular, flat, common ETL output, Excel exports, ChunkyCSV pipesUTF-8; first row is header; consistent delimiter (PapaParse auto-detects ,, ;, \t, |); RFC 4180 quoting for embedded commas/quotes; one logical record per row; no merged cells
JSON (array of records)Structured records, optionally nested, JSONaut output, dbt JSON exportsUTF-8; top-level array of objects; consistent record shape; nested objects addressable via {a.b.c} templates
JSON (single object with array property)OSCAL bundles, deeply nested ontologiesUTF-8; recipe declares the iterator path (e.g., source.iterator: $.catalog.controls[*] — JSONata syntax in v0.1+)
XLSXExcel-native sources (NIST 800-53 catalog as published)First row is header; one sheet (recipe specifies which); flat cells — no merged cells, no formula references in cell values, no embedded sub-tables
XML / RDF / OWLRDF/OWL ontologies, ISO XML feedsNot v0.1. Pre-convert to CSV or JSON via external tooling (the user’s SEACOW / JSONaut handle this). Native XML/RDF parsing deferred to v0.2+.

Depends on shape. CSV is fine — and often better — when:

  • The source is naturally tabular (compliance frameworks: control_id, family, title, baseline, … fit a CSV grid perfectly)
  • You’re already producing CSV via Excel export, ChunkyCSV pipes, or SQL-warehouse export
  • Streaming matters (CSV streams trivially line-by-line; JSON streaming requires stream-json or similar)
  • File size is moderate-to-large (CSV is denser than JSON for tabular data — no key repetition per record)

JSON is better when:

  • Records have nested objects ({ control: { id, title, subcontrols: [...] } })
  • One source contains multiple record types (OSCAL bundle = catalog + controls + groups + parameters + back-matter — recipes pick which to iterate)
  • Values are naturally arrays (tags, parent CURIEs, related-control lists)
  • The source’s authoritative published format is JSON (OSCAL, JSON-LD ontologies, MITRE STIX exports)

For NIST 800-53 r5: CSV is fine (columns: control_id, family, title, baseline, …). For NIST OSCAL JSON catalogs: JSON is native. For ISO 27001 controls: CSV if you have a tabular export; XLSX otherwise. For MITRE ATT&CK: JSON (STIX) is native; CSV exports also exist.

The recipe references columns by name regardless of format. The bundled engine has parsers for each. Format choice is a producer-side decision; the engine adapts.

Two reasons:

  1. Adding a schema for the input format would create a Tier 0.5 contract that producers have to satisfy. That violates the schema-as-primitive commitment (Tier 1 is the only contract). External producers would have to learn two schemas instead of one.
  2. Per-row error handling is cleaner than upfront rejection. If a 50,000-row CSV has 3 malformed rows, the engine processes 49,997 rows and reports the 3 failures. An upfront schema would either reject the whole file or silently filter — both worse outcomes.

The schema validation that does happen is at the output boundary: every row’s rendered Tier 1 frontmatter is validated against spec/tier1.schema.json before being written (v0.1.4 strict-mode validation). Bad rows produce errors; good rows produce files. The contract is enforced where it matters.

Why ChunkyCSV / JSONaut are the right partners for messy sources

Section titled “Why ChunkyCSV / JSONaut are the right partners for messy sources”

The user’s existing tools fit this gap precisely. They specialize in input cleanup — turning a multi-GB messy XLSX into a streaming UTF-8 CSV with stable columns and proper RFC 4180 quoting. The bundled engine specializes in projection — turning structured rows into Tier 1 vault notes per a recipe. Each layer does what it’s good at; the boundary between them is “an iterable of records with stable columns.” That’s the entire contract — and it’s intentionally narrow so external tools don’t need Crosswalker-specific knowledge.

Transform engine depth — where the in-plugin work stops

Section titled “Transform engine depth — where the in-plugin work stops”

A natural follow-on question once Mode 1’s input contract is clear: how much of the messy-source-cleanup problem does Crosswalker try to solve in-plugin? Decided 2026-05-05 (see transform-engine-depth and input-formats decision log + Ch 26):

NARROW                                                                 BROAD
  │                                                                       │
  ▼                                                                       ▼
v0.1 closed     v0.2 wizard     v0.3 inline       v0.5 port jsonaut/    v1.0 full
filter set      adds basic      JSONata expr     chunkycsv as opt-in    in-plugin
(7 filters,     transforms      sub-language     transform engine       transform IDE
shipped)        (rename,        (already                                 (REJECTED —
                trim,           committed in                             scope drift;
                regex,          Ch 23 §6;                                could exceed
                split)          just wires)                              Crosswalker
                                                                         itself)

                                       ▲                                    ▲
                                       │                                    │
                                  CHOSEN                              REJECTED
                                  STOPPING                            (likely
                                  POINT                               permanently)
PhaseWizard offersRecipe author authorsExternal tool handles
v0.1 (shipped)Column-role configClosed 7-filter templatesEverything else
v0.2+ Column rename, value trim, regex extract, simple splitSame templatesConditional logic, joins, lookups, flat-to-tree, fuzzy matching
v0.3+ JSONata expression cells for advanced users+ JSONata 2.x expressions inline ({(baseline = "HIGH") ? "high/" : "mod/"}{control_id}.md)Multi-source joins, time-series, complex pipelines

Why we don’t port JSONaut / ChunkyCSV: JSONata 2.x is a TS-native, well-maintained library covering ~80% of JSONaut’s declarative-transformation feature set. JSONata is the Ch 23 §6 commitment. ChunkyCSV’s CSV-streaming + multi-GB-cleanup features stay external — that’s literally what ChunkyCSV is for, and forcing it into the plugin would be redundant. Both stay first-class Mode 1 feeders.

Why we don’t build a transform IDE: an in-plugin transform IDE would try to bundle live preview, debugging, profiling, output diff, joins UI, conditional logic UI into one experience. Each piece has a simpler v0.2/v0.3/external alternative (wizard preview, JSONata playground, generation-engine error reports, git diff, JSONata lookup{} syntax). Building the IDE is huge scope drift — could exceed Crosswalker itself.

The bundled engine’s input formats and the rationale per format (decided 2026-05-05):

FormatPhaseWhyStream?
CSVv0.1 (shipped)Universal; PapaParse handles encoding + delimiter auto-detect; ChunkyCSV’s natural output✅ via PapaParse step + v0.1.4.5 streaming refactor
JSONL (newline-delimited JSON)v0.2Stream-friendly (line-by-line); native types + nesting; produced by BigQuery, Spark, Databricks, dbt, JSONaut natively✅ trivial — split on \n, JSON.parse per line
JSON with iterator pathv0.2OSCAL bundles, deeply-nested ontologies; recipe declares source.iterator: $.catalog.controls[*] (JSONata path)✅ via stream-json
XLSXv0.3+ (already partial)Excel-native sources (NIST 800-53 r5 catalog ships as XLSX)⚠ sheet-by-sheet via xlsx package
XML / RDF / OWLv0.3+ if demand justifiesRDF/OWL ontologies, ISO XML feeds⚠ requires sax-style streaming; defer to v0.3+

JSONL is the v0.2 priority because it’s the genuine midway between “raw user CSV” and “fully-cleaned-input-ready-for-the-bundled-engine”:

FormatStreamableSchema fidelityETL ecosystemComplexity to integrate
CSV✅ Trivial❌ Strings only✅ UniversalDone
JSONLTrivialNative types + nestingCommon (modern data warehouses)Low (~50 LOC)
Plain JSON array❌ Whole-array parse✅ Native✅ CommonMedium
JSON with iterator✅ via stream-json✅ Native + multi-iteratorLimitedMedium
XLSX⚠ Sheet-by-sheet⚠ Type-mixed✅ Excel-nativeMedium

JSONL is better than CSV for: nested objects, native types, nullable fields. Better than plain JSON for: streaming, no whole-file parse. ETL ecosystem alignment (modern data warehouses emit JSONL natively). ~50 LOC implementation. Big composability win.

For JSON with iterator path (single object containing an array, like OSCAL): support via recipe source.iterator: $.catalog.controls[*] (JSONata-shaped path expression). Engine uses stream-json to navigate lazily.

The bundled engine’s TypeScript code uses a small interface called ParsedData for the wizard’s preview step:

interface ParsedData {
    columns: string[];
    rows: Row[] | AsyncIterable<Row>;  // streaming-friendly
    rowCount?: number;
}
ConceptWhat it ISWhat it ISN’T
ParsedDataAn in-memory TypeScript interface used by the wizard for preview + by the engine as the “structured rows” input shape. May wrap a complete array (small data) OR an async iterator (large data).A persisted intermediate file format. NOT a tier. NOT something external producers consume or emit. (External producers emit CSV/JSON/XML/XLSX — formats the engine knows how to parse — or Tier 1 Markdown directly.)
Tier 1The canonical Markdown vault format on disk, conforming to spec/tier1.schema.json. The load-bearing contract.A serialization-only artifact — it’s the shared vocabulary every producer (bundled engine, external CLI, AI agent) must produce.

ParsedData is just how the engine internally represents “the rows to iterate.” It’s not part of the architectural contract.

How Mode 1 works inside the plugin (v0.1 implementation)

Section titled “How Mode 1 works inside the plugin (v0.1 implementation)”

The bundled engine’s per-row pipeline (after Ch 22 + Ch 23 + Ch 24 design phase, implemented in v0.1.2 + v0.1.3 + v0.1.4, with streaming wired in v0.1.4.5):

Row iterator                    ◄── streaming source
(AsyncIterable<Row>)                  ChunkyCSV pipe / PapaParse step /
   │                                   AsyncIterator<JSON record> / etc.
   │  (one row pulled at a time)

Row n


ConceptIdentity {curie, scope}  ◄── scope = the row itself

   ▼  + Recipe
render(Recipe, Identity)        ◄── pure function (Ch 22 §3)
   │                                 single coupling point

Address {                       ◄── what render() returns
  primary: {path, anchor?},
  wikilinkTarget,
  tags[], aliases[],
  frontmatter (managed)
}

   ▼  + buildProvenance() + body
validateTier1Frontmatter()      ◄── pre-write gate (v0.1.4)
   │                                STRM enforcement here

mergeFrontmatter(existing, new) ◄── user_preserve survives (Ch 22 §8.4)


app.vault.create() / .modify()  ◄── Tier 1 file written


Row n discarded from RAM        ◄── streaming complete; back for row n+1

render() is the single coupling point between recipe and vault layout. Pass 1 is vault-independent (deterministic, hashable, replayable). Per Ch 22 synthesis, this purity is what makes canonical-state hashing work.

A bullet list of what import has to do correctly, in approximate order of how often it goes wrong:

  • Format diversity — CSV, XLSX, JSON, YAML, OSCAL, RDF, MCP server, scraped HTML, OneNote, Notion export. Each has its own tabular-vs-tree shape and its own dirtiness.
  • Tabular-to-tree depth crossing — most messy sources are flat tables that encode a tree (parent-id columns, dotted IDs, prefix conventions). Recovering the tree is non-trivial and source-specific.
  • Identity stability — the same concept may appear in three sources with three IDs. Tier 1 needs a single canonical identity (CURIE + sha256 CID) per concept; recipe authors decide the canonicalization.
  • Polysemic target choice — even after the source is parsed, the recipe author has to choose how it lands: folder hierarchy? heading hierarchy? tag hierarchy? wikilink-graph? Some composition? See hierarchy primitives.
  • Depth control — the same source supports different vault shapes depending on how deep the user wants the hierarchy materialized. NIST CSF 2.0 imported “two layers deep” produces one vault; “all the way to subcategories” produces a different one. Both are valid.
  • Provenance — Tier 1 needs to record where every concept came from, when, with what version, and (eventually) signed by whom.
  • Re-import — version bumps must produce a diff against the prior import, not a silent overwrite or a duplicate vault.
  • Round-trip — exporting Tier 1 back to source-shape (or to OSCAL, or to SSSOM) must work without information loss for the round-trip subset.

No off-the-shelf ETL tool does all of these. Tools come close on subsets — dbt nails reproducibility-and-tabular-transform; RML nails schema-mapping-to-graphs; MCP nails external-source-integration. None target Markdown-vault polyhierarchy with provenance and round-trip as a first-class destination.

This is why the import boundary needs first-principles thinking, not a “pick the popular ETL framework and hope” approach.

Crosswalker’s import space breaks into four orthogonal pieces. They compose; users adopt as much or as little as they want.

1. The Tier 1 schema (machine-readable contract)

Section titled “1. The Tier 1 schema (machine-readable contract)”

Single source of truth for what canonical Crosswalker output looks like. Spelled as JSON Schema (or equivalent — CUE, Dhall) so any external tool can validate against it without bespoke parsers. Contains:

  • File-naming rules
  • Frontmatter shape (required + optional fields, types, allowed values)
  • Folder layout conventions (when applicable)
  • Wikilink target shape and resolution rules
  • Provenance fields (source ref, ingestion timestamp, recipe used, content hash)
  • Identity rules (CURIE format, sha256 CID computation)

Anything that emits valid Tier 1 — produced however — is a Crosswalker-conformant import. The schema is the entire interface.

A lightweight recipe runtime built into the plugin (or accessible via CLI) that handles the common 80% of imports without leaving Obsidian:

  • Common formats out of the box (CSV, XLSX, JSON, YAML)
  • Closed primitive vocabulary for transforms
  • Recipe authoring through the import wizard UI
  • Schema validation on output before write

This is what most users will use most of the time. The engine is not the primitive — it’s just a producer of valid Tier 1, like any other producer. Power users who outgrow it can step into option 3 below without changing how anything downstream of Tier 1 works.

The exact engine implementation (in-plugin TS vs external Python vs hybrid) is open research — see Challenge 23.

When the bundled engine isn’t enough — messy source, unusual transform, domain-specific cleanup — users run their own toolchain (Python + Polars, dbt, Jupyter notebooks, custom scrapers, MCP servers, AI-agent-driven extraction) and emit Tier 1 directly. They use the schema as the contract.

This path is first-class, not a fallback. Crosswalker explicitly does not try to be the universal ETL engine. It tries to be the universal target.

The user’s own ChunkyCSV and JSONaut are examples of external producers — purpose-built ETL tools that could be configured to emit Tier 1 without being absorbed into Crosswalker’s codebase.

4. The community marketplace (transform once, share forever)

Section titled “4. The community marketplace (transform once, share forever)”

Once an ontology is transformed to Tier 1, it stays transformed. There is no reason every NIST 800-53 user should re-do the import. Crosswalker treats community-shared, pre-transformed Tier 1 bundles as a load-bearing piece of the architecture, not a nice-to-have:

  • Someone (the maintainer of an ontology, or an early adopter) does the transformation work once.
  • They publish the resulting Tier 1 directory as a .zip or git repo.
  • Other users download it and have a working vault in seconds, with no recipe authoring at all.
  • When the upstream ontology updates, the maintainer publishes a new bundle; downstream users pull a diff.

This pattern is the answer to the messy-source problem. The bundled engine handles tree-shaped sources well (JSON, YAML, OSCAL); the marketplace fills the gap for messy tabular sources (the original NIST 800-53 XLSX, MITRE ATT&CK matrices) by community-shared pre-transformation. A user with NIST CSF 2.0 doesn’t need to fight the spreadsheet — they download the pre-transformed bundle.

MechanismWhere it livesNotes
In-repo registrySubfolder of Crosswalker repoTight integration; CI validates every bundle on merge; PRs from community mix with engine PRs
Companion repoSeparate crosswalker-recipes (or crosswalker-bundles) repoCleaner separation; independent versioning; easier for non-engine contributors

Either works; the choice is deferrable. Both are GitHub-based, both are copy-paste-or-clone usable.

Even after a source is parsed, every recipe makes five orthogonal choices about how the source lands as Tier 1:

AxisQuestionExamples
DepthHow many levels of the source hierarchy materialize as vault structure?NIST CSF 2.0: “two layers deep” vs “all the way to subcategories”
MechanismWhich of the four hierarchy primitives carry the structure?Folder, heading, tag, wikilink-graph, or composition
FilterWhich subset of the source is imported?”Only AC family controls” / “only HIGH-baseline controls” / “all”
GranularityOne file per leaf concept, or one file per group with leaves as headings?Per-control file vs per-family file
ProjectionWhich fields of each concept land as frontmatter, body, wikilink, or are dropped?Control text → body; control ID → frontmatter; baseline → tag

These five axes are why “the recipe” is a non-trivial declarative artifact, not a single transform expression. They’re also the reason a single source ontology can produce many legitimately-different vaults — and why Crosswalker can’t just “auto-import” a source without recipe author input.

Underneath the recipe, the bundled engine implements a closed set of transformation primitives. Comprehensive but small. Drawn from RML/YARRRML, FNML, JSONata, dbt, and the user’s prior tooling. Grouped by category:

CategoryPrimitives (illustrative)
Source iterationiterate-rows, iterate-records, iterate-tree-nodes, iterate-paths, iterate-grouped
Identity / ID synthesiscurie-from-pattern, sha256-cid, uuid-v7, slugify, normalize-id
Field projectionproject, rename, drop, default, coerce-type, parse-date
String transformstrim, lowercase, uppercase, replace, split, join, regex-extract, regex-replace
Tree-from-flatparent-id-to-tree, dotted-id-to-tree, prefix-to-tree, indent-to-tree
Joins / lookupsinner-join, left-join, lookup-table, fuzzy-match, alias-resolve
Address renderingfolder-path, heading-anchor, tag-path, wikilink-target
Validation / guardrequire-field, allowed-values, type-check, schema-validate
Provenancerecord-source-ref, record-version, record-timestamp, record-hash

A recipe is a pipeline (or DAG) of these primitives. Composition is what makes the engine complete enough — Macro Tree Transducer theory (per Ch 20 deliverable A) tells us roughly 5–6 algebraic operators are sufficient; the ~40 above unfold those operators into named, recipe-author-friendly forms.

The catalog is not finalized. It is the active design surface.

Several of the Ch 20 deliverables recommended YARRRML as the surface DSL for recipes. This is a real choice the user has flagged for further evaluation. Here’s the plain-English version:

YARRRML is a YAML format for writing data-mapping recipes. You write a YAML file that says “this column in my CSV maps to this property in the output, with this transform.” That’s it.

The complicated-looking name comes from “YAML for RDF Mapping Language” — its original use was generating RDF triples. But the structure (sources / mappings / transforms / sinks) is widely useful even if you ignore the RDF heritage and retarget to Markdown notes.

A toy YARRRML-shaped recipe (not real syntax — illustrative):

sources:
  controls: { type: csv, file: NIST_800-53_r5.csv }

mappings:
  control:
    iterator: $.controls.rows
    target:
      file: "Frameworks/NIST 800-53 r5/{{family}}/{{control_id}}.md"
      frontmatter:
        title: "{{control_id}} — {{control_name}}"
        family: "{{family}}"
        baseline: "{{baseline}}"
        tags:
          - "framework/nist-800-53-r5"
          - "framework/nist-800-53-r5/{{family|lowercase}}"
      body: |
        ## Control text
        {{control_text}}

Five things to notice:

  1. Sources — declare inputs once, by type and location.
  2. Mappings — for each row/record/node, declare the output target.
  3. Templates{{...}} interpolates fields with optional filters (|lowercase).
  4. Sinksfile, frontmatter, body, tags. Closed vocabulary; same vocabulary covers any domain.
  5. No code — this is data, not Python. An AI agent can read, generate, modify, and validate it.

Why it’s a strong candidate: 12+ years of refinement; mature tooling (Matey editor, parsers); tree-shaped recipes are easier to validate, diff, version, and share than imperative scripts. Anyone — author, agent, validator — can reason about a YAML file mechanically.

Why it’s not a foregone conclusion: YARRRML’s RDF heritage may bring vocabulary baggage; retargeting to Markdown sinks needs a dialect; a custom Crosswalker-native DSL might fit better; or JSON Schema + plain JSONata might be enough without the YARRRML wrapper.

A dedicated explainer page lives at agent-context/agent-tooling/yarrrml-explained (TODO — coming as the Ch 23 research and recipe-DSL design firm up).

How this connects to the rest of Crosswalker

Section titled “How this connects to the rest of Crosswalker”

Edge-level concerns (STRM predicates, SSSOM rows, junction notes, ontology diff atoms) operate on concept identity and don’t care which import mechanism produced the vault. The single coupling point is the address-rendering function that turns a concept identity into a wikilink target string given the recipe’s hierarchy-primitive choices. See hierarchy primitives for that separation.

This means the import-side architecture (this page) is strictly orthogonal to the edge-side architecture. A user could import the same NIST 800-53 catalog with three completely different recipes and three completely different vault layouts; the resulting STRM/SSSOM/junction-note structure would be semantically identical.

Where this came from — broader portfolio context

Section titled “Where this came from — broader portfolio context”

Crosswalker’s stance on import — “schema as primitive, ETL as convenience, marketplace as community escape hatch” — emerged from a portfolio of related tools the user has built over time, all aimed at applied information science with Obsidian as the platform:

  • ChunkyCSV — CSV-shaped tabular ETL with tree transducer flavor. Precedent for the “tabular sources need first-class tree-recovery” axis.
  • JSONaut — declarative JSON transformation. Precedent for the “external producers should be able to emit Tier 1 without bespoke Crosswalker code” axis.
  • SEACOW — meta-framework for knowledge organization in filesystem primitives. Precedent for the “primary folder + parallel tag hierarchy” composition pattern.
  • folder-tag-sync — Obsidian plugin that bidirectionally syncs folder hierarchy with tag hierarchy. Precedent for live composition rules between two of the four hierarchy primitives.

Crosswalker is not these tools rebranded — it is the general ingestion target their patterns point at. The marketplace pattern in particular is the answer to the question that ChunkyCSV and JSONaut individually pose: “we keep doing this transform every time someone imports the same source — why?”

Items the architecture has named but not yet committed:

QuestionDefault leanStatus
Bundle engine implementation languagePure TypeScript in-plugin (Path A) for v0.1; Hybrid (Path C — opt-in external Python producer) for v0.5+RESOLVED 2026-05-04 — see Ch 23 synthesis log. Mobile-Obsidian portability + small-OSS contributor pool forced the verdict
Surface DSL flavorYARRRML-shaped (declarative); v0.1 may ship a simpler subsetOpen
Marketplace mechanismEither in-repo registry or companion repo; both viableDeferrable
Hardcoded vs declarative dialect specDeclarative (the dialect IS data even when only one ships)Provisionally settled
target_schema as JSON Schema (machine-readable) vs proseMachine-readableSettled
Recipe schema runtime-agnosticJSON Schema + AJV; JSONata 2.x as expression sub-language; engine implementations swappable without breaking user filesSettled 2026-05-04 per Ch 23 §4 — the single most important architectural commitment of the design phase
External-producer protocol surface (push-into-Crosswalker via MCP)Defer until Tier 1 schema landsOpen

See the 2026-05-04 import engine design log for the canonical state of these decisions.

Concept pages:

Agent context:

  • v0.1 schema spec — current Tier 1 contract; the load-bearing primitive this page describes
  • Vision — schema-as-primitive is the central architectural commitment
  • Tradeoffs — convenience-vs-canonical, bundled-engine-vs-external-producer

Implementation milestones (Path A):

Design decisions (synthesis logs):

Research challenges (resolved → archive):

Research deliverables:

Spec files: