Skip to content
🚧 Early alpha — building the foundation. See the roadmap →

Ch 20 dialog: Is this just ETL? + Data has fundamental forms (user/agent exchange that crystallized two architectural insights)

Created Updated

Ch 20 dialog — “Is this just ETL?” and “Data has fundamental forms”

Section titled “Ch 20 dialog — “Is this just ETL?” and “Data has fundamental forms””

Captured during the Ch 20 fresh-agent research session, 2026-05-03. Two user questions and the agent’s responses crystallized two architectural insights worth preserving as research input. Verbatim from the session.


So what I’m essentially building is an ETL engine for importing frameworks and ontologies into my Crosswalker system. Was that accounted for here?

Yes — but your question is worth unpacking because it exposes a subtle distinction the deliverable handles implicitly that’s worth making explicit.

The deliverable does treat this as an ETL problem throughout. The six T1TMA primitives (ITERATE, REFERENCE, TEMPLATE, BIND, JOIN, INVERT) are directly shaped by the fact that your sources are framework catalogs (NIST 800-53, CIS, ISO 27001, SCF, etc.) and your target is a knowledge graph in Obsidian. The NIST 800-53 worked example, the closed slot vocabulary (id, label, body.section, frontmatter., links., folder), and the JOIN primitive for hierarchy — those all exist because of the framework-import use case.

But your question hints at something deeper: is there anything special about ETL for ontologies and frameworks versus generic ETL? The answer is yes, and it’s worth naming the three things that make your problem structurally different from, say, loading sales data into a warehouse:

First, your records aren’t flat rows — they’re nodes in a graph. A NIST 800-53 control isn’t just a row with fields. It’s an entity with typed relationships to other entities (parent family, child enhancements, related controls, mapped-to controls in other frameworks). Generic ETL engines (dbt, dlt, Singer) assume tabular-in, tabular-out. Your ETL must produce a graph — notes with wikilinks, folder hierarchies, and semantic edge metadata. The T1TMA handles this through the links.<role> slot and the JOIN primitive, which together let a recipe say “this column points at another entity, and the relationship has type parent / related / mapped_to.” That’s the piece generic ETL frameworks miss entirely.

Second, your target schema is semantically constrained, not arbitrary. You’re not loading into a blank database. You’re materializing into a Tier 1 representation that downstream systems (STRM predicates, SSSOM envelope, ontology diff engine, Nemo derivation) consume with expectations. The closed slot vocabulary in T1TMA exists precisely for this reason — it forces every recipe to produce output that the rest of the Crosswalker stack can reason about. A generic ETL engine would let you write anything to any field, and the semantic contract would be implicit. Here it’s explicit.

Third, the same conceptual entity exists in many serialization formats. NIST 800-53 r5 ships as CSV, XLSX, OSCAL JSON, OSCAL XML, and PDF. CIS Controls ship as XLSX and PDF. SCF ships as XLSX. The same ontology arrives in radically different physical shapes. This is why the ITERATE primitive parameterizes on formulation — the recipe author shouldn’t have to rewrite the entire recipe just because the source format changed. The conceptual mapping (“column X is the control ID”) stays the same; only the physical access layer changes. That’s the RML insight the deliverable adopts.

So to directly answer your question: the deliverable accounts for this being an ontology/framework ETL engine in three specific ways — graph-aware output (links + joins), semantically constrained target (closed slot vocabulary composing with STRM/SSSOM/diff), and format-diverse sources (formulation-parameterized iteration). The six primitives aren’t generic ETL primitives that happen to work for frameworks; they’re shaped by the framework-import problem and wouldn’t look the same if the target were a SQL warehouse or an RDF triplestore.

The one area where you might want to push further: the deliverable doesn’t deeply address the “ontology” part of “ontology ETL.” Specifically, when you import SCF’s master spreadsheet, you’re not just importing controls — you’re importing an ontology’s assertion about how controls relate across frameworks. That means some imported rows are entities (controls) and some are edges (mappings between controls). The T1TMA handles this through the metadata.<sssom-key> slot (which passes through to the SSSOM envelope), but it treats entity-import and edge-import as the same operation with different slot bindings. Whether that uniformity is a feature or a gap depends on whether you ever need recipes that reason about the difference between “I’m importing a node” and “I’m importing an edge” at recipe-definition time. Right now the deliverable says “that reasoning happens downstream in Nemo/STRM, not in the recipe.” That seems right for v0.2, but it’s worth flagging as a v1.0+ question.


Am I oversimplifying it when I’m thinking of ETL as changing the shape of the data? Data can philosophically and fundamentally be described and if we take that and then apply ETL as the solution then we should be able to define primitives and fundamental shapes/forms that we can turn into a schema and account for with either existing ETL logic or some that we build…does that make sense?

No, you’re not oversimplifying. You’re actually arriving at the same insight the formal literature arrives at, just from the practical direction instead of the theoretical one. Let me trace why your intuition is essentially correct and where the one wrinkle lives.

Your reasoning, restated precisely: if data has a finite number of fundamental structural forms, and ETL is the mapping between forms, then the set of primitive ETL operations should be derivable from the set of fundamental data forms. That’s not a simplification — that’s the core claim of the tree transducer literature, and it’s provably true.

Here’s why. When you look at every data format Crosswalker touches, they all reduce to a small number of structural forms:

  • Tables (CSV, TSV, XLSX) — a sequence of rows, each row a fixed-width tuple of scalar values. The essential shape is [{key: value, key: value, ...}, ...].
  • Trees (JSON, YAML, XML, OSCAL) — nested labeled nodes where each node has a tag, optional attributes, and ordered children. The essential shape is node(label, {attrs}, [children]).
  • Graphs (RDF, your Tier 1 vault with wikilinks) — nodes plus edges, where edges have labels. The essential shape is (node, edge-label, node) triples, or equivalently, a tree-of-adjacency-lists.
  • Flat text with conventions (Markdown + frontmatter) — which is really a tree if you parse it: YAML header (a tree) plus section hierarchy (a tree) plus inline links (edges).

That’s it. Every source and every target Crosswalker will ever encounter is one of: table, tree, graph, or a composite of these. And here’s the key realization — tables and graphs are both special cases of trees. A table is a tree of depth 2 (root → rows → cells). A graph is a tree if you pick a traversal order (which serialization always does). So the universal structural form is the labeled tree, and the universal ETL operation is the tree-to-tree transformation.

That’s exactly what Engelfriet and Vogler proved in the 1980s with macro tree transducers. They asked your exact question from the math side: “if the fundamental data shape is a labeled tree, what are the irreducible operations for transforming one labeled tree into another?” And they got a small, closed answer — five operations (construct a node, descend into a child, copy a subtree, consult accumulated context, pattern-match to discriminate). Every tree-to-tree function that terminates can be decomposed into compositions of those five.

So your intuition — “describe the shapes, then derive the transformations” — is the right mental model. The T1TMA’s six primitives (ITERATE, REFERENCE, TEMPLATE, BIND, JOIN, INVERT) are exactly those five tree-transducer operations repackaged for a per-record execution model, plus the lens inverse.

Now, the one wrinkle where “changing shape” is slightly too simple: ETL doesn’t just change shape. It also does three other things that are structurally distinct from reshaping.

It loses information. When you import OSCAL JSON into Tier 1, you discard UUIDs, parameter constraints, and structural metadata that don’t fit the target shape. That loss is irreversible unless you explicitly stash the residue somewhere. This is why the INVERT primitive exists and why round-tripping is opt-in rather than automatic — not every shape change is invertible, and the recipe must declare which parts are.

It resolves references. When a CSV row says parent_id: AC-1 and another row has id: AC-1, the ETL must join those rows to produce a wikilink. That’s not reshaping a single record — it’s resolving a cross-record pointer. This is why JOIN is a separate primitive rather than being absorbed into TEMPLATE. The shape of a single record doesn’t contain enough information; you need the shape of the collection.

It asserts the target contract. The output isn’t just “data in a new shape” — it’s data that must satisfy the Tier 1 contract (every note has an id, frontmatter is valid YAML, wikilinks point at real notes, folder paths are legal). This is why the closed slot vocabulary exists. A pure shape-transformation would let you produce anything; the slot vocabulary constrains the output to shapes the rest of the stack can consume.

But these three wrinkles don’t undermine your core insight. They’re refinements of it. Loss is what happens when the target shape has fewer dimensions than the source shape. Reference resolution is what happens when the source shape has implicit pointers that the target shape makes explicit. Contract assertion is what happens when the target shape has invariants the source shape doesn’t.

So to answer your question directly: you’re not oversimplifying — you’re seeing the essential structure. The deliverable’s six primitives are exactly what falls out when you take your intuition (“data has fundamental forms; ETL is the mapping between forms”) and work it through rigorously. The forms are trees. The mappings are tree transducers. The primitives are the irreducible operations of tree transducers, projected onto a practical per-record execution model. ChunkyCSV, your earlier tool, was you discovering the table↔tree isomorphism empirically — nested JSON is a tree, tabular CSV is a depth-2 tree, and ChunkyCSV is a specific tree transducer that navigates between those two depths. T1TMA generalizes that to arbitrary source and target tree shapes.


Why preserve this dialog as a research artifact

Section titled “Why preserve this dialog as a research artifact”

Two architectural insights crystallized in this exchange that are not stated as cleanly anywhere else in the Ch 20 deliverables:

  1. Crosswalker is graph-aware, semantically-constrained, format-diverse ETL — distinct from generic tabular ETL (dbt, dlt, Singer). The three ways it differs (graph-aware output, semantically-constrained target, format-diverse sources) shape the primitive set.
  2. Data has fundamental structural forms; ETL is the mapping between forms; the set of primitive ETL operations is derivable from the set of fundamental forms. The forms are labeled trees. Tables and graphs are special cases. The universal operation is tree-to-tree transformation. This is the Engelfriet/Vogler MTT result reached from the practical direction.

The user arrived at insight #2 from first-principles intuition during the dialog. The agent confirmed and sharpened it against the formal literature. Both directions converge on the same primitive set — which is the strongest evidence available that the count and shape are right.

These insights inform the synthesis log more than any individual deliverable, because they tell us why we should trust the converging recommendations from Runs 1 and 3, and why ChunkyCSV (the user’s earlier tool) is the right practical-precedent touchstone.