Skip to content
🚧 Early alpha — building the foundation. See the roadmap →

Import engine design — closing the design phase before concrete work begins

Created Updated

§1 The reframe — schema as the load-bearing primitive

Section titled “§1 The reframe — schema as the load-bearing primitive”

The single most consequential architectural decision of this design phase:

Crosswalker’s load-bearing primitive is the Tier 1 target schema. ETL is optional convenience. External ETL is fine.

This dissolves several previously-open questions in one move:

Open questionResolved by reframe
Build vs buy ETL engine?Neither is load-bearing. Specify the schema; ship a thin reference engine for the zero-config case; let users produce Tier 1 with whatever ETL they prefer (Python, dbt, dlt, ChunkyCSV, hand-written, GPT)
Custom DSL vs familiar one?The schema is the contract. Producers use any expression language as long as the output conforms
Where do recipes live?Only matters for the bundled reference engine. External users keep recipes wherever — the schema doesn’t care
In-plugin TS vs external Python?Both. Tiny in-plugin engine for “open vault, click import CSV, done” UX. External tools for power users / batch / CI / agents
Protocol surface (Path D from Ch 21)?Falls out naturally — the protocol IS the schema. Anyone speaking it is a producer; Crosswalker is the canonical receiver

The architectural precedent is strong: HTML (browsers don’t care how you produce it), JSON Schema, OpenAPI, SBOM formats (SPDX/CycloneDX), Markdown itself. The spec is the artifact; the engines are commodity.

Four pieces, layered cleanly:

2.1 The Tier 1 target schema (load-bearing contract)

Section titled “2.1 The Tier 1 target schema (load-bearing contract)”

The Tier 1 representation Crosswalker actually defines. Composed of:

  • Markdown files with YAML frontmatter
  • Folder layout (composable with the other hierarchy mechanisms)
  • STRM predicate vocabulary for crosswalk edges
  • SSSOM envelope for crosswalk metadata
  • Junction notes (13-field schema) for evidence links
  • _crosswalker metadata block for provenance
  • The closed sink vocabulary from Ch 20 (path, frontmatter[k], body[region], wikilink[role])

This contract MUST be expressible as a machine-readable artifact (tier1.schema.json or LinkML) so external producers can validate output mechanically.

2.2 The bundled reference ingestion engine (convenience layer)

Section titled “2.2 The bundled reference ingestion engine (convenience layer)”

A thin in-plugin reference engine that handles the 80% case (“open vault, import CSV, done”) with no external tooling. Signature:

engine(
  source,           // parsed input (CSV, XLSX, JSON, OSCAL, RDF, etc.)
  target_schema,    // Tier 1 contract (machine-readable spec)
  recipe            // user's multi-axis selection
) → Tier 1 vault

Three inputs, one output. The engine has no hardcoded knowledge of any source ontology. All per-source decisions live in the recipe.

The bundled engine has a tabular-versus-tree asymmetry it should lean into and acknowledge the limit of:

Source shapeBundled engine handles?
Tree (JSON, YAML, JSON-LD, OSCAL JSON, XML)Yes — inherent structure means reshape primitives suffice
Tabular with hierarchy column (well-formed CSV)Yes — basic interpretation rules cover most cases
Tabular without hierarchy (ad-hoc CSV, XLSX with merged cells)Maybe — depends on recipe complexity; falls back to marketplace
Prose (PDF, HTML)No — left to external tools

2.3 External producers (anyone, any language)

Section titled “2.3 External producers (anyone, any language)”

External ETL tools target the schema directly. Python+Polars+DuckDB; dbt; dlt; ChunkyCSV; JSONaut; bash scripts; LLM-driven extractors; bespoke per-org code. Output conforms to the Tier 1 contract. Crosswalker validates and accepts.

This is not a downgrade — it’s the platform-architecture pattern from what makes Crosswalker unique made operational. The bundled engine is one valid producer. Many others are also valid.

2.4 Marketplace of pre-transformed ontology bundles (community pattern)

Section titled “2.4 Marketplace of pre-transformed ontology bundles (community pattern)”

“Once someone transforms an ontology once, then it’s transformed.”

For ontologies with messy tabular sources that the bundled engine can’t handle, the community one-time-transforms once and shares the result. Each marketplace bundle ships:

  • The original source URL + version pin (nist:csf-2.0 + sha256)
  • The recipe used (so it’s re-runnable when upstream updates)
  • The pre-built Tier 1 vault content
  • A migration crosswalk (when the bundle updates)

crosswalker install nist-csf-2.0 becomes the user-facing flow. The user doesn’t write a recipe; they install someone else’s working one. This is the architectural answer to the tabular-source problem — the bundled engine doesn’t need to handle every weird XLSX because the community handles each one once.

The recipe carries the per-source flexibility. Five orthogonal axes the recipe author controls:

#AxisWhat it controlsExample for NIST CSFv2
1DepthHow many levels of the source tree to materializeUser A: Functions + Categories only (2 deep). User B: + Subcategories (3 deep). User C: + Implementation Examples (4 deep).
2Hierarchy mechanismWhich of folder / heading / tag / wikilink-graph carries hierarchyUser A: folders. User B: folders top-2, headings within. User C: tags only.
3Inclusion filterWhich branches/subtrees to include or exclude”Only the Govern and Identify Functions, not Detect/Respond/Recover”
4Per-level granularityWhat’s captured at each level (full body vs label-only vs collapsed)“Categories get a stub note; Subcategories get full body + frontmatter”
5Cross-cutting projectionWhich source fields become frontmatter vs body vs tags vs wikilinks”Implementation guidance → body section. Reference IDs → frontmatter. Functions → parallel tag.”

These compose. The same NIST CSFv2 source produces dramatically different vault outputs depending on the recipe — the engine is uniform; recipes carry the decisions.

The recipe-DSL surface needs to express these five axes concisely. Open question (see §6).

Documented in concepts/hierarchy-primitives. Brief recap:

#PrimitiveCardinalityWhen right
1FolderMono-hierarchical (one path per note)Deep semi-stable hierarchies; matches GRC consultant filesystem habits
2Markdown headingMono-hierarchical within fileVery deep ontologies (MITRE ATT&CK ~thousands of nodes); shallow ontologies sharing metadata
3TagPolyhierarchical (multiple parents)Cross-cutting concerns; parallel views over same flat file set
4Wikilink-graphMaximally graph-shapedGenuinely graph-shaped ontologies; PKM-style workflows

Real recipes compose these. The full target-structure expressivity question is the subject of Ch 22. For this design log: the four mechanisms are the primitive Obsidian affordances the engine has to work with; recipes choose which to use at which level.

The bundled engine’s transformation vocabulary. ~40 primitives in nine categories, drawing on the convergent answer from RML/YARRRML, JSONata, Bento/Bloblang, SSSOM/T, Cribl-style processors, the user’s own ChunkyCSV + JSONaut tools, and macro tree transducer theory:

CategoryPrimitivesNotes
Pathproject, get, set, walk, descend, parentMTT-derived structural ops
Reshaperestructure, rename, flatten, unflatten, merge, splitTree-to-tree shape changes
Stringregex extract / replace, split, join, trim, case-convert, slugify, template-interpolateCribl/Bento parity
Typeparse-date, parse-number, parse-bool, coerce, format-date, format-numberStandard coercions
Filterpredicate, when, reject, distinctPredicate-based selection
Aggregategroup-by, count, sum, min, max, concat, sortReductions
Referencelookup, expand-curie, contract-curie, resolve-wikilink, join-on-keyCross-record resolution; address-rendering for Ch 22 lands here
Generatetemplate, hash, uuid, slug, incrementSynthesizing values
Validatetype-check, required, cardinality, schema-conform, regex-matchConstraint enforcement
Cross-structuraltabularize (tree→rows), treeify (rows→tree), edge-extract, pivot, unpivotThe format-boundary primitives — ChunkyCSV territory

Total bundle cost stays well under 500 KB. JSONata as the expression sub-language gives the recipe author an open escape hatch for anything not in the catalog.

The exact catalog (with input/output JSON Schemas per primitive) is the spec work that needs to happen before implementation.

§6 Architectural commitments — settled vs still-open

Section titled “§6 Architectural commitments — settled vs still-open”

Settled (no further research needed; ready for spec work)

Section titled “Settled (no further research needed; ready for spec work)”
  • Schema is the load-bearing primitive; ETL is convenience (§1)
  • Engine signature: (source × target_schema × recipe) → Tier 1 (§2.2)
  • External producers welcome; the schema is the protocol (§2.3)
  • Marketplace as architectural answer to messy-source problem (§2.4)
  • Five recipe axes (depth, mechanism, filter, granularity, projection) (§3)
  • Four hierarchy mechanisms as the Obsidian primitives the recipe composes over (§4)
  • ~40-primitive transformation catalog by category (§5)
  • JSONata as expression sub-language within recipes
  • Tier 1 schema must be machine-readable (JSON Schema or LinkML), not just a docs page

Still open (decision needed before or during implementation)

Section titled “Still open (decision needed before or during implementation)”

User reactions to these open decisions (received 2026-05-04 after this log first landed) are recorded in the Status / user signal column.

OpenOptionsDefault leanStatus / user signal
Bundled engine implementation languageTypeScript/Bun (in-plugin); Python (external CLI); Hybrid; Rust/Go-WASM; JVMHybrid (TS in-plugin + external Python)RESOLVED 2026-05-04 by Ch 23 deliverable. Verdict: Path A (Pure TS in-plugin) for v0.1; Path C (Hybrid: optional external Python producer) reserved for v0.5+. 8 of 9 commitments adopted; Bun-stays disagreement recorded in synthesis log. Two irreversible constraints forced the answer: mobile-Obsidian portability + small-OSS contributor pool
Recipe DSL surface syntaxYARRRML-shaped YAML; Dhall-typed; JSON Schema with discriminated unions; hybridYARRRML-shaped YAMLUser unfamiliar with YARRRML; ELI5 explainer landed inline in ETL and import § YARRRML, explained simply; dedicated explainer page deferred to agent-tooling section as the recipe DSL choice firms up
Recipe storage location.crosswalker/recipes/ in vault; plugin user-data dir; user choice.crosswalker/recipes/ in vault — git-versions with content; matches files-canonical principleSettled
Marketplace mechanismIn-repo registry (subfolder of main repo); companion repo (e.g., crosswalker-recipes); built-in registry UI (v1.0+); Obsidian community plugins (v1.0+)Either in-repo subfolder or companion repo; both viableUser-confirmed flexible — “built in registry in repo would be nice or I just make an additional repo where you can copy paste or download them”. Deferrable until implementation forces a pick
Target_schema as data vs hardcodedHardcoded (v0.1 simplicity); declarative dialect spec (longer-term)Declarative-ready architecture, single dialect at v0.1. v0.1 ships ONE canonical dialect, but the dialect IS itself machine-readable JSON Schema (data, not code). Community-authored variants become a v1.0+ feature without engine reworkUser-leaned declarative: “declarative likely more though from sound of it”. Adopting declarative-ready architecture; deferring user-authored variant UX
External-producer protocol surface (push-into-Crosswalker via MCP / external scrapers / agent-driven extraction)Defer; design now; defer with stubDefer until Tier 1 schema is machine-readable; the schema IS most of the protocolOpen — likely picks itself up once Tier 1 schema lands as JSON Schema
Tier 2 substrate@sqlite.org/sqlite-wasm + sqlite-vec (canonical, foundation-governed) vs libSQL (Turso’s fork) vs Turso Cloud Tier 3 vs Limbo long-horizonStay on canonical SQLite + sqlite-vecRESOLVED 2026-05-04 by Ch 24 deliverable. REJECT all three Qs (libSQL Tier 2 migration, Turso Cloud Tier 3 listing, Limbo near-term adoption). Vendor-trajectory signal — Turso publicly de-prioritized libSQL in favor of Limbo. Vector-layer-decoupled-from-substrate (sqlite-vec portable; libSQL native vector locked-in) elevated as load-bearing modularity commitment. Five explicit migration triggers locked. See synthesis log
Target-structure expressivity (recipe-author surface for choosing folder vs heading vs tag vs wikilink-graph layout)Per-level mechanism map; render(Recipe, ConceptIdentity) → Address as single coupling point; content-addressing before renderClosed grammar of 5 mechanisms × ordered layout × also_emit × graph_edges; v0.1 wires folder+file+headingRESOLVED 2026-05-04 by Ch 22 deliverable. The v0.1 recipe schema is fully specified. See target-structure synthesis log
Agent-tooling / progressive-disclosure spaceWithin KB (agent-context/agent-tooling/); separate repo for agent-consumption materialKB section to start; split out if volume justifiesSkeleton landed at agent-context/agent-tooling/ 2026-05-04; bodies fill as the underlying specs land

These are concrete decisions for spec/implementation time, not blockers for moving forward.

The v0.1 stack pivot committed to:

  • TypeScript/Bun bundled Obsidian plugin (~1.2 MB)
  • Tier 1 + Tier 2 sqlite-wasm sidecar
  • Markdown + YAML frontmatter as canonical
  • STRM-shaped TSV / OSCAL JSON / SSSOM-flavored TSV exports

This design log adds, doesn’t override, the v0.1 commitment. v0.1 ships:

  1. The Tier 1 target schema (tier1.schema.json machine-readable + design/import-engine human-readable pillar)
  2. The bundled reference ingestion engine (in-plugin TS, ~480 KB, handles tree-shaped sources cleanly + simple tabular)
  3. A starter recipe library for canonical sources (NIST 800-53, ISO 27002, MITRE ATT&CK basic)
  4. A schema validator (anyone’s Tier 1 output can be validated against the spec)

Not in v0.1 (deferred to v1.0+):

  • Marketplace registry mechanism (just a GitHub repo at v0.1)
  • Declarative target-schema dialects (hardcoded for v0.1)
  • External-Python ETL CLI (users hand-author Python if they need it; bundled engine handles the easy 80%)
  • Full lens-style round-trip (forward-only at v0.1; INVERT primitive deferred)

§8 Concrete next steps — what closes the design phase

Section titled “§8 Concrete next steps — what closes the design phase”

Ready for implementation only after these land:

#ArtifactLocationStatus
1Vault hierarchy primitives concept pageconcepts/hierarchy-primitives✅ written 2026-05-04
2ETL-and-import concept pillarconcepts/etl-and-import (new)✅ written 2026-05-04 — frames the schema-as-primitive + marketplace + tabular asymmetry; includes inline YARRRML ELI5
3Import engine design pillardesign/import-engine (new — supersedes stale design/transformation)UNBLOCKED 2026-05-04 by Ch 23 resolution. Engine signature + recipe DSL + transformation catalog + reference implementation plan can now commit to TS + Bun + JSONata + AJV
4Architecture pillar updatedesign/architecture (existing, stale)TODO — Tier 1/2/3 + safety guarantee + the engine’s place in it
5Import wizard feature pagefeatures/import-wizard (existing, stale)TODO — refocus as “the bundled reference engine UI”
6Tier 1 schema spec — machine-readablespec/tier1.schema.json (or LinkML at spec/tier1.linkml.yaml)TODO — the contract external producers target
7Recipe DSL spec — machine-readablespec/recipe.schema.jsonUNBLOCKED 2026-05-04 by Ch 23 resolution + fully informed by Ch 22 resolution. Runtime-agnostic JSON Schema; JSONata 2.x as expression sub-language; reserve producer field for v0.5+. Closed grammar of 5 mechanisms × ordered layout × also_emit × graph_edges per Ch 22 §10. First development artifact landing in this push.
8Transformation primitive library specspec/primitives/ (one JSON Schema per primitive)TODO — input/output contracts for each of ~40 primitives
9Starter recipe libraryrecipes/starter/ (NIST 800-53, ISO 27002, MITRE ATT&CK)TODO
10Ch 23 — Bundle engine language researchzz-challenges/archive/23-bundle-engine-language (archived)✅ deliverable landed 2026-05-04; synthesis log adopts 8 of 9 commitments; brief archived
11Agent-tooling progressive-disclosure section (header + getting-started; bodies fill as #6–#9 land)agent-context/agent-tooling/✅ skeleton landed 2026-05-04
12Ch 23 synthesis log — committed v0.1 stack lockzz-log/2026-05-04-bundle-engine-language-synthesis✅ landed 2026-05-04

When (1)–(9) are in place, the implementation work has a complete spec to build against. No further design conversations needed.

After (1)–(9):

  • Wire the bundled reference engine into the existing Obsidian plugin
  • Replace the current import wizard’s hardcoded behavior with recipe-driven behavior
  • Ship v0.1 with two or three working starter recipes
  • Iterate from real usage