ETL and import — schema as primitive, ETL as convenience
The import side is the hardest part of Crosswalker. Everything a user wants to do — query, crosswalk, evidence, version, share — flows through whatever they already have, in whatever shape it arrives, and lands in the Tier 1 schema. Get the import boundary right and the rest of the architecture works. Get it wrong and no amount of cleverness downstream rescues it.
This page frames how Crosswalker thinks about that boundary. For the system-wide picture (all six layers, not just import), see the system architecture page.
The reframe — schema is the primitive, ETL is convenience
Section titled “The reframe — schema is the primitive, ETL is convenience”Most ingestion tools start with the engine and treat the output schema as a configuration knob. Crosswalker inverts that:
The Tier 1 schema is the primitive. Crosswalker is, before anything else, an ingestion target — a precise, machine-readable contract that says what canonical Markdown + frontmatter + folder layout + wikilinks looks like. The engine that produces conforming output is convenience. Anyone can produce conforming output any way they like.
Architectural precedents for this stance: HTML, JSON Schema, OpenAPI, SBOM (SPDX/CycloneDX), CSVW, Markdown itself. Each is a receiving format — the spec is the load-bearing artifact; the producer ecosystem grew around it.
Translated to Crosswalker:
| Layer | Status |
|---|---|
| Tier 1 schema (the contract) | Load-bearing primitive. Machine-readable JSON Schema. |
| Bundled ETL engine (the convenience) | Optional implementation that produces Tier 1 from common source shapes. |
| External producers (custom code, dbt, Python, scrapers, MCP servers) | First-class citizens. Anything that emits valid Tier 1 is welcome. |
| Community marketplace (pre-transformed bundles) | Once an ontology is shaped to Tier 1, it stays shaped. Communities share .zips of Tier 1 directories. |
This reframe collapses several open architectural questions (build vs buy, DSL choice, in-plugin vs external runtime, recipe location) into one commitment. The engine becomes swappable; the schema does not.
The two-mode architecture — bundled projector or direct emission
Section titled “The two-mode architecture — bundled projector or direct emission”Updated 2026-05-05 to clarify how external ETL tools (ChunkyCSV, JSONaut, Polars/DuckDB scripts, dbt, etc.) compose with the bundled engine. See the 2026-05-05 two-mode architecture decision log for the full rationale.
The bundled engine has one job: take structured rows + a recipe, and emit Tier 1 Markdown. That’s it. Any producer — including users with no external tooling, ChunkyCSV pipelines, dbt projects, AI agents, MCP servers — chooses between two entry points:
| Mode | Entry point | What the producer does | Streaming responsibility |
|---|---|---|---|
| Mode 1 — Bundled projector | Hand structured rows (CSV/JSON/XML/XLSX) + recipe to the bundled engine | Produces a structured input the engine knows how to iterate | Bundled engine streams per-row through render() → write → discard. Producer can also stream upstream (ChunkyCSV → CSV file) — both layers stream. |
| Mode 2 — Direct emission | Bypass the bundled engine; write Tier 1 Markdown files into the vault | Produces Tier 1 Markdown end-to-end | Producer’s responsibility entirely (typical: per-concept Markdown emission already streams naturally) |
Both modes are first-class architectural citizens per the schema-as-primitive commitment. The schema (Tier 1) is the contract; the engine is convenience.
Why both modes matter
Section titled “Why both modes matter”Mode 1 lets ChunkyCSV / JSONaut / dbt / Polars scripts compose naturally. They already produce CSV/JSON. Crosswalker just consumes that. The user doesn’t have to teach their existing ETL tools how to emit Markdown — they keep doing what they’re good at, and the bundled engine handles the projection layer.
Mode 2 lets AI agents and bundled-marketplace publishers emit Tier 1 end-to-end. When an agent extracts an ontology from a corpus, it produces concepts directly — there’s no natural intermediate “rows” representation. Skipping the bundled engine is the right call.
Streaming is at the engine boundary, not before it
Section titled “Streaming is at the engine boundary, not before it”The bundled engine is streaming-by-design (per v0.1.4.5 streaming refactor): it accepts rows as an iterator/async-iterator, calls render() per row, writes the file, and discards the row before reading the next one. The full source dataset never exists in RAM at once.
This is why Mode 1 + ChunkyCSV / JSONaut composes well: ChunkyCSV streams from a multi-gigabyte source, emits a smaller-but-still-streaming CSV, the bundled engine streams through that without accumulating, and the resulting Tier 1 vault is materialized on disk file-by-file. End-to-end streaming pipeline.
What “structured” actually means — the Mode 1 input contract
Section titled “What “structured” actually means — the Mode 1 input contract”A common follow-up question once the two-mode architecture clicks: how does the bundled engine know an input is “clean enough”? Is there an encoding requirement? A specific JSON shape? An XML schema?
Honest answer: the contract is shape-level, not format-level. The engine consumes an iterable of records. Each record is {columnName: value}. Columns are referenced by name in the recipe’s template strings ({control_id}, {family.title}). That’s the entire contract.
The bundled engine doesn’t validate the input against a “Tier 0.5 schema” because no Tier 0.5 schema exists (2026-05-05 two-mode architecture decision §“No Tier 0.5” confirms this). The engine just calls render(recipe, identityFromRow) on each record and writes the resulting Tier 1 file. Per-row failures (missing template variable, malformed encoding, validation rejection) become per-row errors; the import continues.
Shape requirements
Section titled “Shape requirements”Required:
- Stable columns across records — every row has the same set of keys (missing keys default to undefined, which produces empty values)
- Recipe-referenced columns are present — if a template uses
{control_id}, the source must have acontrol_idcolumn - Values addressable by recipe templates — flat strings are simplest; nested objects work via dotted access; arrays need iterator declarations in the recipe
- Encoding: UTF-8 — every text format (CSV, JSON, XML, XLSX-as-CSV-export) should arrive UTF-8
That’s it. No strict schema validation upstream of the engine.
Format-by-format guidance for Mode 1 producers
Section titled “Format-by-format guidance for Mode 1 producers”| Format | When to use | What “clean enough” looks like |
|---|---|---|
| CSV | Tabular, flat, common ETL output, Excel exports, ChunkyCSV pipes | UTF-8; first row is header; consistent delimiter (PapaParse auto-detects ,, ;, \t, |); RFC 4180 quoting for embedded commas/quotes; one logical record per row; no merged cells |
| JSON (array of records) | Structured records, optionally nested, JSONaut output, dbt JSON exports | UTF-8; top-level array of objects; consistent record shape; nested objects addressable via {a.b.c} templates |
| JSON (single object with array property) | OSCAL bundles, deeply nested ontologies | UTF-8; recipe declares the iterator path (e.g., source.iterator: $.catalog.controls[*] — JSONata syntax in v0.1+) |
| XLSX | Excel-native sources (NIST 800-53 catalog as published) | First row is header; one sheet (recipe specifies which); flat cells — no merged cells, no formula references in cell values, no embedded sub-tables |
| XML / RDF / OWL | RDF/OWL ontologies, ISO XML feeds | Not v0.1. Pre-convert to CSV or JSON via external tooling (the user’s SEACOW / JSONaut handle this). Native XML/RDF parsing deferred to v0.2+. |
Is JSON “the way to go” over CSV?
Section titled “Is JSON “the way to go” over CSV?”Depends on shape. CSV is fine — and often better — when:
- The source is naturally tabular (compliance frameworks: control_id, family, title, baseline, … fit a CSV grid perfectly)
- You’re already producing CSV via Excel export, ChunkyCSV pipes, or SQL-warehouse export
- Streaming matters (CSV streams trivially line-by-line; JSON streaming requires
stream-jsonor similar) - File size is moderate-to-large (CSV is denser than JSON for tabular data — no key repetition per record)
JSON is better when:
- Records have nested objects (
{ control: { id, title, subcontrols: [...] } }) - One source contains multiple record types (OSCAL bundle = catalog + controls + groups + parameters + back-matter — recipes pick which to iterate)
- Values are naturally arrays (tags, parent CURIEs, related-control lists)
- The source’s authoritative published format is JSON (OSCAL, JSON-LD ontologies, MITRE STIX exports)
For NIST 800-53 r5: CSV is fine (columns: control_id, family, title, baseline, …). For NIST OSCAL JSON catalogs: JSON is native. For ISO 27001 controls: CSV if you have a tabular export; XLSX otherwise. For MITRE ATT&CK: JSON (STIX) is native; CSV exports also exist.
The recipe references columns by name regardless of format. The bundled engine has parsers for each. Format choice is a producer-side decision; the engine adapts.
Why no upfront schema validation?
Section titled “Why no upfront schema validation?”Two reasons:
- Adding a schema for the input format would create a Tier 0.5 contract that producers have to satisfy. That violates the schema-as-primitive commitment (Tier 1 is the only contract). External producers would have to learn two schemas instead of one.
- Per-row error handling is cleaner than upfront rejection. If a 50,000-row CSV has 3 malformed rows, the engine processes 49,997 rows and reports the 3 failures. An upfront schema would either reject the whole file or silently filter — both worse outcomes.
The schema validation that does happen is at the output boundary: every row’s rendered Tier 1 frontmatter is validated against spec/tier1.schema.json before being written (v0.1.4 strict-mode validation). Bad rows produce errors; good rows produce files. The contract is enforced where it matters.
Why ChunkyCSV / JSONaut are the right partners for messy sources
Section titled “Why ChunkyCSV / JSONaut are the right partners for messy sources”The user’s existing tools fit this gap precisely. They specialize in input cleanup — turning a multi-GB messy XLSX into a streaming UTF-8 CSV with stable columns and proper RFC 4180 quoting. The bundled engine specializes in projection — turning structured rows into Tier 1 vault notes per a recipe. Each layer does what it’s good at; the boundary between them is “an iterable of records with stable columns.” That’s the entire contract — and it’s intentionally narrow so external tools don’t need Crosswalker-specific knowledge.
Transform engine depth — where the in-plugin work stops
Section titled “Transform engine depth — where the in-plugin work stops”A natural follow-on question once Mode 1’s input contract is clear: how much of the messy-source-cleanup problem does Crosswalker try to solve in-plugin? Decided 2026-05-05 (see transform-engine-depth and input-formats decision log + Ch 26):
| Phase | Wizard offers | Recipe author authors | External tool handles |
|---|---|---|---|
| v0.1 (shipped) | Column-role config | Closed 7-filter templates | Everything else |
| v0.2 | + Column rename, value trim, regex extract, simple split | Same templates | Conditional logic, joins, lookups, flat-to-tree, fuzzy matching |
| v0.3 | + JSONata expression cells for advanced users | + JSONata 2.x expressions inline ({(baseline = "HIGH") ? "high/" : "mod/"}{control_id}.md) | Multi-source joins, time-series, complex pipelines |
Why we don’t port JSONaut / ChunkyCSV: JSONata 2.x is a TS-native, well-maintained library covering ~80% of JSONaut’s declarative-transformation feature set. JSONata is the Ch 23 §6 commitment. ChunkyCSV’s CSV-streaming + multi-GB-cleanup features stay external — that’s literally what ChunkyCSV is for, and forcing it into the plugin would be redundant. Both stay first-class Mode 1 feeders.
Why we don’t build a transform IDE: an in-plugin transform IDE would try to bundle live preview, debugging, profiling, output diff, joins UI, conditional logic UI into one experience. Each piece has a simpler v0.2/v0.3/external alternative (wizard preview, JSONata playground, generation-engine error reports, git diff, JSONata lookup{} syntax). Building the IDE is huge scope drift — could exceed Crosswalker itself.
v0.1 → v0.3+ input format roster
Section titled “v0.1 → v0.3+ input format roster”The bundled engine’s input formats and the rationale per format (decided 2026-05-05):
| Format | Phase | Why | Stream? |
|---|---|---|---|
| CSV | v0.1 (shipped) | Universal; PapaParse handles encoding + delimiter auto-detect; ChunkyCSV’s natural output | ✅ via PapaParse step + v0.1.4.5 streaming refactor |
| JSONL (newline-delimited JSON) | v0.2 | Stream-friendly (line-by-line); native types + nesting; produced by BigQuery, Spark, Databricks, dbt, JSONaut natively | ✅ trivial — split on \n, JSON.parse per line |
| JSON with iterator path | v0.2 | OSCAL bundles, deeply-nested ontologies; recipe declares source.iterator: $.catalog.controls[*] (JSONata path) | ✅ via stream-json |
| XLSX | v0.3+ (already partial) | Excel-native sources (NIST 800-53 r5 catalog ships as XLSX) | ⚠ sheet-by-sheet via xlsx package |
| XML / RDF / OWL | v0.3+ if demand justifies | RDF/OWL ontologies, ISO XML feeds | ⚠ requires sax-style streaming; defer to v0.3+ |
JSONL is the v0.2 priority because it’s the genuine midway between “raw user CSV” and “fully-cleaned-input-ready-for-the-bundled-engine”:
| Format | Streamable | Schema fidelity | ETL ecosystem | Complexity to integrate |
|---|---|---|---|---|
| CSV | ✅ Trivial | ❌ Strings only | ✅ Universal | Done |
| JSONL | ✅ Trivial | ✅ Native types + nesting | ✅ Common (modern data warehouses) | Low (~50 LOC) |
| Plain JSON array | ❌ Whole-array parse | ✅ Native | ✅ Common | Medium |
| JSON with iterator | ✅ via stream-json | ✅ Native + multi-iterator | Limited | Medium |
| XLSX | ⚠ Sheet-by-sheet | ⚠ Type-mixed | ✅ Excel-native | Medium |
JSONL is better than CSV for: nested objects, native types, nullable fields. Better than plain JSON for: streaming, no whole-file parse. ETL ecosystem alignment (modern data warehouses emit JSONL natively). ~50 LOC implementation. Big composability win.
For JSON with iterator path (single object containing an array, like OSCAL): support via recipe source.iterator: $.catalog.controls[*] (JSONata-shaped path expression). Engine uses stream-json to navigate lazily.
What ParsedData is (and isn’t)
Section titled “What ParsedData is (and isn’t)”The bundled engine’s TypeScript code uses a small interface called ParsedData for the wizard’s preview step:
| Concept | What it IS | What it ISN’T |
|---|---|---|
ParsedData | An in-memory TypeScript interface used by the wizard for preview + by the engine as the “structured rows” input shape. May wrap a complete array (small data) OR an async iterator (large data). | A persisted intermediate file format. NOT a tier. NOT something external producers consume or emit. (External producers emit CSV/JSON/XML/XLSX — formats the engine knows how to parse — or Tier 1 Markdown directly.) |
| Tier 1 | The canonical Markdown vault format on disk, conforming to spec/tier1.schema.json. The load-bearing contract. | A serialization-only artifact — it’s the shared vocabulary every producer (bundled engine, external CLI, AI agent) must produce. |
ParsedData is just how the engine internally represents “the rows to iterate.” It’s not part of the architectural contract.
How Mode 1 works inside the plugin (v0.1 implementation)
Section titled “How Mode 1 works inside the plugin (v0.1 implementation)”The bundled engine’s per-row pipeline (after Ch 22 + Ch 23 + Ch 24 design phase, implemented in v0.1.2 + v0.1.3 + v0.1.4, with streaming wired in v0.1.4.5):
render() is the single coupling point between recipe and vault layout. Pass 1 is vault-independent (deterministic, hashable, replayable). Per Ch 22 synthesis, this purity is what makes canonical-state hashing work.
Why import is the hardest part
Section titled “Why import is the hardest part”A bullet list of what import has to do correctly, in approximate order of how often it goes wrong:
- Format diversity — CSV, XLSX, JSON, YAML, OSCAL, RDF, MCP server, scraped HTML, OneNote, Notion export. Each has its own tabular-vs-tree shape and its own dirtiness.
- Tabular-to-tree depth crossing — most messy sources are flat tables that encode a tree (parent-id columns, dotted IDs, prefix conventions). Recovering the tree is non-trivial and source-specific.
- Identity stability — the same concept may appear in three sources with three IDs. Tier 1 needs a single canonical identity (CURIE + sha256 CID) per concept; recipe authors decide the canonicalization.
- Polysemic target choice — even after the source is parsed, the recipe author has to choose how it lands: folder hierarchy? heading hierarchy? tag hierarchy? wikilink-graph? Some composition? See hierarchy primitives.
- Depth control — the same source supports different vault shapes depending on how deep the user wants the hierarchy materialized. NIST CSF 2.0 imported “two layers deep” produces one vault; “all the way to subcategories” produces a different one. Both are valid.
- Provenance — Tier 1 needs to record where every concept came from, when, with what version, and (eventually) signed by whom.
- Re-import — version bumps must produce a diff against the prior import, not a silent overwrite or a duplicate vault.
- Round-trip — exporting Tier 1 back to source-shape (or to OSCAL, or to SSSOM) must work without information loss for the round-trip subset.
No off-the-shelf ETL tool does all of these. Tools come close on subsets — dbt nails reproducibility-and-tabular-transform; RML nails schema-mapping-to-graphs; MCP nails external-source-integration. None target Markdown-vault polyhierarchy with provenance and round-trip as a first-class destination.
This is why the import boundary needs first-principles thinking, not a “pick the popular ETL framework and hope” approach.
The four architectural pieces
Section titled “The four architectural pieces”Crosswalker’s import space breaks into four orthogonal pieces. They compose; users adopt as much or as little as they want.
1. The Tier 1 schema (machine-readable contract)
Section titled “1. The Tier 1 schema (machine-readable contract)”Single source of truth for what canonical Crosswalker output looks like. Spelled as JSON Schema (or equivalent — CUE, Dhall) so any external tool can validate against it without bespoke parsers. Contains:
- File-naming rules
- Frontmatter shape (required + optional fields, types, allowed values)
- Folder layout conventions (when applicable)
- Wikilink target shape and resolution rules
- Provenance fields (source ref, ingestion timestamp, recipe used, content hash)
- Identity rules (CURIE format, sha256 CID computation)
Anything that emits valid Tier 1 — produced however — is a Crosswalker-conformant import. The schema is the entire interface.
2. The bundled ETL engine (convenience)
Section titled “2. The bundled ETL engine (convenience)”A lightweight recipe runtime built into the plugin (or accessible via CLI) that handles the common 80% of imports without leaving Obsidian:
- Common formats out of the box (CSV, XLSX, JSON, YAML)
- Closed primitive vocabulary for transforms
- Recipe authoring through the import wizard UI
- Schema validation on output before write
This is what most users will use most of the time. The engine is not the primitive — it’s just a producer of valid Tier 1, like any other producer. Power users who outgrow it can step into option 3 below without changing how anything downstream of Tier 1 works.
The exact engine implementation (in-plugin TS vs external Python vs hybrid) is open research — see Challenge 23.
3. External producers (custom transforms)
Section titled “3. External producers (custom transforms)”When the bundled engine isn’t enough — messy source, unusual transform, domain-specific cleanup — users run their own toolchain (Python + Polars, dbt, Jupyter notebooks, custom scrapers, MCP servers, AI-agent-driven extraction) and emit Tier 1 directly. They use the schema as the contract.
This path is first-class, not a fallback. Crosswalker explicitly does not try to be the universal ETL engine. It tries to be the universal target.
The user’s own ChunkyCSV and JSONaut are examples of external producers — purpose-built ETL tools that could be configured to emit Tier 1 without being absorbed into Crosswalker’s codebase.
4. The community marketplace (transform once, share forever)
Section titled “4. The community marketplace (transform once, share forever)”Once an ontology is transformed to Tier 1, it stays transformed. There is no reason every NIST 800-53 user should re-do the import. Crosswalker treats community-shared, pre-transformed Tier 1 bundles as a load-bearing piece of the architecture, not a nice-to-have:
- Someone (the maintainer of an ontology, or an early adopter) does the transformation work once.
- They publish the resulting Tier 1 directory as a
.zipor git repo. - Other users download it and have a working vault in seconds, with no recipe authoring at all.
- When the upstream ontology updates, the maintainer publishes a new bundle; downstream users pull a diff.
This pattern is the answer to the messy-source problem. The bundled engine handles tree-shaped sources well (JSON, YAML, OSCAL); the marketplace fills the gap for messy tabular sources (the original NIST 800-53 XLSX, MITRE ATT&CK matrices) by community-shared pre-transformation. A user with NIST CSF 2.0 doesn’t need to fight the spreadsheet — they download the pre-transformed bundle.
| Mechanism | Where it lives | Notes |
|---|---|---|
| In-repo registry | Subfolder of Crosswalker repo | Tight integration; CI validates every bundle on merge; PRs from community mix with engine PRs |
| Companion repo | Separate crosswalker-recipes (or crosswalker-bundles) repo | Cleaner separation; independent versioning; easier for non-engine contributors |
Either works; the choice is deferrable. Both are GitHub-based, both are copy-paste-or-clone usable.
The five-axis recipe selection
Section titled “The five-axis recipe selection”Even after a source is parsed, every recipe makes five orthogonal choices about how the source lands as Tier 1:
| Axis | Question | Examples |
|---|---|---|
| Depth | How many levels of the source hierarchy materialize as vault structure? | NIST CSF 2.0: “two layers deep” vs “all the way to subcategories” |
| Mechanism | Which of the four hierarchy primitives carry the structure? | Folder, heading, tag, wikilink-graph, or composition |
| Filter | Which subset of the source is imported? | ”Only AC family controls” / “only HIGH-baseline controls” / “all” |
| Granularity | One file per leaf concept, or one file per group with leaves as headings? | Per-control file vs per-family file |
| Projection | Which fields of each concept land as frontmatter, body, wikilink, or are dropped? | Control text → body; control ID → frontmatter; baseline → tag |
These five axes are why “the recipe” is a non-trivial declarative artifact, not a single transform expression. They’re also the reason a single source ontology can produce many legitimately-different vaults — and why Crosswalker can’t just “auto-import” a source without recipe author input.
The ~40-primitive transformation catalog
Section titled “The ~40-primitive transformation catalog”Underneath the recipe, the bundled engine implements a closed set of transformation primitives. Comprehensive but small. Drawn from RML/YARRRML, FNML, JSONata, dbt, and the user’s prior tooling. Grouped by category:
| Category | Primitives (illustrative) |
|---|---|
| Source iteration | iterate-rows, iterate-records, iterate-tree-nodes, iterate-paths, iterate-grouped |
| Identity / ID synthesis | curie-from-pattern, sha256-cid, uuid-v7, slugify, normalize-id |
| Field projection | project, rename, drop, default, coerce-type, parse-date |
| String transforms | trim, lowercase, uppercase, replace, split, join, regex-extract, regex-replace |
| Tree-from-flat | parent-id-to-tree, dotted-id-to-tree, prefix-to-tree, indent-to-tree |
| Joins / lookups | inner-join, left-join, lookup-table, fuzzy-match, alias-resolve |
| Address rendering | folder-path, heading-anchor, tag-path, wikilink-target |
| Validation / guard | require-field, allowed-values, type-check, schema-validate |
| Provenance | record-source-ref, record-version, record-timestamp, record-hash |
A recipe is a pipeline (or DAG) of these primitives. Composition is what makes the engine complete enough — Macro Tree Transducer theory (per Ch 20 deliverable A) tells us roughly 5–6 algebraic operators are sufficient; the ~40 above unfold those operators into named, recipe-author-friendly forms.
The catalog is not finalized. It is the active design surface.
YARRRML, explained simply
Section titled “YARRRML, explained simply”Several of the Ch 20 deliverables recommended YARRRML as the surface DSL for recipes. This is a real choice the user has flagged for further evaluation. Here’s the plain-English version:
YARRRML is a YAML format for writing data-mapping recipes. You write a YAML file that says “this column in my CSV maps to this property in the output, with this transform.” That’s it.
The complicated-looking name comes from “YAML for RDF Mapping Language” — its original use was generating RDF triples. But the structure (sources / mappings / transforms / sinks) is widely useful even if you ignore the RDF heritage and retarget to Markdown notes.
A toy YARRRML-shaped recipe (not real syntax — illustrative):
Five things to notice:
- Sources — declare inputs once, by type and location.
- Mappings — for each row/record/node, declare the output target.
- Templates —
{{...}}interpolates fields with optional filters (|lowercase). - Sinks —
file,frontmatter,body,tags. Closed vocabulary; same vocabulary covers any domain. - No code — this is data, not Python. An AI agent can read, generate, modify, and validate it.
Why it’s a strong candidate: 12+ years of refinement; mature tooling (Matey editor, parsers); tree-shaped recipes are easier to validate, diff, version, and share than imperative scripts. Anyone — author, agent, validator — can reason about a YAML file mechanically.
Why it’s not a foregone conclusion: YARRRML’s RDF heritage may bring vocabulary baggage; retargeting to Markdown sinks needs a dialect; a custom Crosswalker-native DSL might fit better; or JSON Schema + plain JSONata might be enough without the YARRRML wrapper.
A dedicated explainer page lives at agent-context/agent-tooling/yarrrml-explained (TODO — coming as the Ch 23 research and recipe-DSL design firm up).
How this connects to the rest of Crosswalker
Section titled “How this connects to the rest of Crosswalker”Edge-level concerns (STRM predicates, SSSOM rows, junction notes, ontology diff atoms) operate on concept identity and don’t care which import mechanism produced the vault. The single coupling point is the address-rendering function that turns a concept identity into a wikilink target string given the recipe’s hierarchy-primitive choices. See hierarchy primitives for that separation.
This means the import-side architecture (this page) is strictly orthogonal to the edge-side architecture. A user could import the same NIST 800-53 catalog with three completely different recipes and three completely different vault layouts; the resulting STRM/SSSOM/junction-note structure would be semantically identical.
Where this came from — broader portfolio context
Section titled “Where this came from — broader portfolio context”Crosswalker’s stance on import — “schema as primitive, ETL as convenience, marketplace as community escape hatch” — emerged from a portfolio of related tools the user has built over time, all aimed at applied information science with Obsidian as the platform:
- ChunkyCSV — CSV-shaped tabular ETL with tree transducer flavor. Precedent for the “tabular sources need first-class tree-recovery” axis.
- JSONaut — declarative JSON transformation. Precedent for the “external producers should be able to emit Tier 1 without bespoke Crosswalker code” axis.
- SEACOW — meta-framework for knowledge organization in filesystem primitives. Precedent for the “primary folder + parallel tag hierarchy” composition pattern.
- folder-tag-sync — Obsidian plugin that bidirectionally syncs folder hierarchy with tag hierarchy. Precedent for live composition rules between two of the four hierarchy primitives.
Crosswalker is not these tools rebranded — it is the general ingestion target their patterns point at. The marketplace pattern in particular is the answer to the question that ChunkyCSV and JSONaut individually pose: “we keep doing this transform every time someone imports the same source — why?”
Open design questions
Section titled “Open design questions”Items the architecture has named but not yet committed:
| Question | Default lean | Status |
|---|---|---|
| Bundle engine implementation language | Pure TypeScript in-plugin (Path A) for v0.1; Hybrid (Path C — opt-in external Python producer) for v0.5+ | ✅ RESOLVED 2026-05-04 — see Ch 23 synthesis log. Mobile-Obsidian portability + small-OSS contributor pool forced the verdict |
| Surface DSL flavor | YARRRML-shaped (declarative); v0.1 may ship a simpler subset | Open |
| Marketplace mechanism | Either in-repo registry or companion repo; both viable | Deferrable |
| Hardcoded vs declarative dialect spec | Declarative (the dialect IS data even when only one ships) | Provisionally settled |
target_schema as JSON Schema (machine-readable) vs prose | Machine-readable | Settled |
| Recipe schema runtime-agnostic | JSON Schema + AJV; JSONata 2.x as expression sub-language; engine implementations swappable without breaking user files | ✅ Settled 2026-05-04 per Ch 23 §4 — the single most important architectural commitment of the design phase |
| External-producer protocol surface (push-into-Crosswalker via MCP) | Defer until Tier 1 schema lands | Open |
See the 2026-05-04 import engine design log for the canonical state of these decisions.
Related
Section titled “Related”Concept pages:
- Hierarchy primitives — folder, file, heading, tag, wikilink and how recipes compose them via the 5-mechanism grammar
- The problem — hierarchy-vs-graph tension; why ingestion target matters
- What makes Crosswalker unique — Spec / Library / Integrations pillar that this page activates
- Embedded vs server substrates — Tier 2 sidecar projection from Tier 1
- Ontology evolution — re-import semantics + provenance
- Terminology — definitions: Tier 1, recipe, render(), CURIE, ParsedData (Path A only), provenance
Agent context:
- v0.1 schema spec — current Tier 1 contract; the load-bearing primitive this page describes
- Vision — schema-as-primitive is the central architectural commitment
- Tradeoffs — convenience-vs-canonical, bundled-engine-vs-external-producer
Implementation milestones (Path A):
- v0.1.1 — Type system + validation foundation — AJV wires the schema into runtime
- v0.1.2 —
render()v1 — the single coupling point ships - v0.1.3 — Generation engine integration — render() wired through the bundled engine
- v0.1.4 — Junction notes + crosswalk edges — kind discriminator + STRM enforcement at the validation gate
- v0.1.5 — Tier 2 sqlite-wasm sidecar projector — Tier 1 → Tier 2 projection (queries layer)
- v0.1.7 — Exporters — Tier 1 → STRM TSV / OSCAL JSON / SSSOM TSV (round-trip determinism)
Design decisions (synthesis logs):
- 2026-05-05 ETL pipeline clarification — what ParsedData is + isn’t; the three producer paths drawn explicitly
- 2026-05-04 import-engine design log — canonical state of import-engine commitments
- Ch 22 synthesis (target-structure expressivity) — closed 5-mechanism recipe grammar;
render()signature - Ch 23 synthesis (bundle/engine/language) — Path A for v0.1, Path C for v0.5+; runtime-agnostic recipe schema (the most important modularity commitment)
- Ch 24 synthesis (Tier 2 substrate) — sqlite-wasm + sqlite-vec
- Ch 20 synthesis log (formal transformation algebra) — the wargaming setup
- v0.1.3 delivery log
- v0.1.4 delivery log
Research challenges (resolved → archive):
- Ch 21 — Build vs buy ETL engine — meta-question above engine implementation
- Ch 23 — Bundle engine language (archived) — resolved 2026-05-04: Path A for v0.1, Path C for v0.5+
Research deliverables:
- Ch 20 deliverable a (T1TMA primitives)
- Ch 20 deliverable c (RML retargeted) — R2RML lineage of the
{var|filter}template grammar - Ch 22 deliverable (target-structure expressivity)
- Ch 23 deliverable (bundle/engine/language)
Spec files:
spec/tier1.schema.json— the contract this page describesspec/recipe.schema.json— recipe shape Path A consumes