Challenge 21: Should Crosswalker build its own import/ETL engine, adopt an existing one, or compose them? — long-term build-vs-buy with critical and adversarial thinking
Why this exists
Section titled “Why this exists”Challenge 20 settled the question of what shape Crosswalker’s import primitive should take — three deliverables converged on a graph-aware, semantically-constrained, format-diverse ETL with ~5–6 primitives over a closed Tier-1 sink vocabulary. The synthesis log is the canonical foundation.
But Ch 20 left a much bigger meta-question implicit. The user surfaced it directly:
“Technically this is ETL and my JSONaut or ChunkyCSV tools were ways to do that. I think I may have to port that logic into this codebase and/or do more research on that.”
“Let’s maybe create a challenge for what we’re talking about here. Needs to have critical and long-term thinking, should we be building an ETL engine essentially, should we use an external existing one (doesn’t seem tenable), etc.”
This challenge addresses that meta-question. It is strictly upstream of Ch 20: not “what primitives” but “where do those primitives come from — our codebase or someone else’s?”
The framing
Section titled “The framing”The user’s parenthetical (“doesn’t seem tenable”) about external engines is not a foregone conclusion. It is a hypothesis that needs adversarial pressure. There are at least four serious answers to “should Crosswalker build its own ETL engine”:
| Path | What it looks like |
|---|---|
| A. Build (full bespoke) | Crosswalker authors and maintains its own ETL engine from scratch. Ch 20 deliverables sketch this (~480 KB pure-TS). Maximum control, maximum maintenance burden. |
| B. Buy (adopt wholesale) | Crosswalker wraps an existing ETL engine — dbt, dlt, Singer/Meltano, Airbyte, Apache Hop, Kettle/PDI, RMLMapper-JS, Morph-KGC, etc. — and writes only the thin Tier-1 destination/projection layer. Minimum maintenance, maximum dependency surface. |
| C. Compose (thin layer over existing) | Crosswalker uses an existing engine for the heavy lifting (file parsing, expression evaluation, joins) and writes only a Tier-1-aware composition layer. Most concrete proposals in Ch 20 deliverables (RML+JSONata+CSVW reuse) already lean this way. |
| D. Define a protocol; let multiple engines speak it | Crosswalker defines the import-protocol surface (Layer A from Ch 20: ref/resolve/bind/seal) as an open spec. Any ETL engine — bespoke or third-party — can implement it. The user’s “external connections, not just Obsidian-internal logic” insight points at this. |
These four are not mutually exclusive at every layer. Crosswalker might Build the schema (Path A), Compose the runtime (Path C, reusing JSONata + CSVW + papaparse), and Define the protocol (Path D). The challenge is to figure out which mix is durable for 5–10 years.
This is fundamentally a build-vs-buy with first-principles + governance + sustainability concerns, made harder by:
- Crosswalker is a small open-source project with no business model funding maintenance (yet). Anything we build is on the contributor pool to maintain.
- Existing ETL ecosystems have well-known failure modes: vendor pivots (AGE/Bitnine), abandonment (CozoDB), license changes (HashiCorp BSL), foundation governance vs single-vendor risk.
- The user’s prior tools (ChunkyCSV, JSONaut) are concrete precedents — they exist, they work, they’re already authored. Whether to port logic from them into Crosswalker, embed them as dependencies, or treat them as separate concerns is itself part of the question.
- The Ch 20 deliverables implicitly recommend Path C (thin layer over JSONata + CSVW + papaparse + a small ~150 KB recipe runtime) but don’t justify the build-side of that mix against full Path B alternatives.
What to investigate
Section titled “What to investigate”1. Honest taxonomy of existing ETL / data-transformation engines
Section titled “1. Honest taxonomy of existing ETL / data-transformation engines”For each candidate, score on dimensions that matter for Crosswalker’s durable deployment.
Note on dimension #9 (target-structure expressivity): most off-the-shelf ETL engines bake in a single target-structure assumption (tabular rows, RDF triples, key-value pairs). Crosswalker’s Ch 22 (target-structure expressivity) surfaces that real recipes need to compose folder + heading + tag + wikilink hierarchies in arbitrary mixes, with parallel polyhierarchies via tags. No off-the-shelf engine handles this. This single dimension is likely the strongest argument against Path B (buy wholesale) — and conversely, the case for Path A or Path C is that Ch 22’s expressivity must come from somewhere, and only an in-house composition layer can deliver it.
Tabular ETL frameworks (dbt, dlt, Singer/Meltano, CloudQuery, Fivetran-OSS, Airbyte, Apache Hop, Kettle / Pentaho Data Integration):
- Can they target a file-based knowledge graph (Markdown + frontmatter + folders + wikilinks), or are they fundamentally tabular-target?
- License / governance / sustainability (foundation-backed? single-vendor? recent license changes?)
- TypeScript/JS-native or JVM/Python? (Crosswalker is Bun/TS)
- Bundle size if embedded in plugin
- Active maintenance signal (commits, releases, contributor diversity)
- Maturity of “destination-markdown” or “destination-file” or custom-destination story
Schema-mapping / KG-construction engines (RMLMapper-JS, RMLMapper-Java, Morph-KGC, OBDA tools like Ontop, R2RML processors, YARRRML preprocessors, Mastro):
- Most are RDF-targeted; can they retarget? At what cost?
- Bundle size in JS environments
- Active maintenance vs research-project status
- Standards alignment (W3C RML draft, R2RML W3C Rec)
Stream/batch processing engines (Apache Beam, Apache Spark, Apache Flink, Bun-native streams, RxJS):
- Almost certainly overkill for Crosswalker’s data volumes (Ch 18 ceiling: 100K mappings)
- Worth confirming the overkill verdict empirically, not by gut feel
Declarative JSON/data DSLs (JSONata, JQ, Jolt, JSONiq, XSLT 3.0):
- Already covered in Ch 20; confirm the recommendation to use JSONata as expression sub-language doesn’t bring in undue maintenance cost (the JSONata reference impl is actively maintained but solo-maintainer Andrew Coleman; what happens if it stalls?)
Pandas-style dataframe libraries (Pandas, Polars, Arquero, Danfo.js):
- Imperative-not-declarative style
- Bundle size + JS maturity
- Are they appropriate for the recipe layer or only the implementation layer?
The user’s own tools (ChunkyCSV, JSONaut):
- What’s actually in them? (Read the code; don’t take the descriptions on faith.)
- What’s the right disposition: port logic into Crosswalker, embed as dependencies, leave separate, retire?
- Crucially: are they generic ETL primitives or specific tools that happen to share an ETL flavor?
Embedded ETL libraries (Sequelize-style ORMs as anti-pattern; lightweight TS libs like csv-parse, papaparse, exceljs, xpath, xml2js, js-yaml — i.e. file-format parsers used as building blocks):
- These are the components one would compose under Path C
- Each individually is small and well-maintained; the question is whether the composition is itself an “ETL engine” or just plumbing
2. Long-term sustainability — the AGE/Bitnine question applied to ETL
Section titled “2. Long-term sustainability — the AGE/Bitnine question applied to ETL”The user has watched single-vendor governance failure modes play out across the project’s research:
- Apache AGE → Bitnine pivoted to AI advertising → AGE governance reduced to maintenance mode
- Kuzu → Apple acquired and archived October 2025
- CozoDB → no release since v0.7 in 2023; maintainer attention shifted
- TerminusDB → DFRNT stewardship since 2025; small commercial sponsor; key-person risk
- HashiCorp Terraform → BSL license change → community forked OpenTofu
- MongoDB → SSPL → community moved to Postgres + alternatives
The same failure modes apply to ETL engines. Singer started open and was acquired; Meltano spun out. dlt is a YC-stage company. Airbyte raised significant VC. Apache Hop is a true Apache project (TLP) but has narrow contributor base. dbt-core is open but dbt Labs is a $4B+ company with strong commercial interests.
For each candidate engine, the analysis must answer:
- Who maintains it? Foundation, company, individual?
- What’s their incentive structure? Commercial cloud product? VC-funded growth? Volunteer?
- What’s the precedent for license/governance changes in this space?
- What’s the bus factor? How many committers in the last 12 months?
- If they pivot/license-change, what’s our exit path?
Apache Foundation governance (Apache Hop, Apache Beam, Apache Spark) is the structurally robust answer here, mirroring the Ch 16 conclusion that Apache Jena Fuseki is the safest 5–10 year bet for Tier 3.
3. Opportunity cost analysis
Section titled “3. Opportunity cost analysis”The “build” path costs developer-hours that aren’t spent on the parts of Crosswalker that are actually unique:
- STRM + SSSOM crosswalk edge semantics (this is novel; nobody else does it)
- Junction notes and the 13-field evidence-link model (novel)
- StewardshipProfile and the meta-schema lifecycle commitment (novel)
- Ontology diff primitives (novel)
- The protocol surface insight (novel)
Versus parts that are commodities in 2026:
- CSV parsing (papaparse, csv-parse, fast-csv — all mature)
- JSON path expressions (JSONata, JQ, jsonpath-plus — all mature)
- XML/XPath (xpath, @xmldom/xmldom — mature but heavy)
- XLSX parsing (exceljs, sheetjs — mature)
- Format conversion (lots of options)
- Recipe-as-data DSL evaluation (multiple options)
The strongest case for “build minimal, buy components” is precisely this opportunity-cost framing. Every hour spent reinventing CSV parsing is an hour not spent on the actually-novel STRM+SSSOM+junction-note+StewardshipProfile work. The Ch 20 deliverables already implicitly chose this framing (papaparse + exceljs + JSONata + CSVW), but didn’t justify it explicitly against full-Path-A or full-Path-B alternatives.
4. The “we already have an ETL stdlib in 2026” reality check
Section titled “4. The “we already have an ETL stdlib in 2026” reality check”In 2010, building a custom ETL was the only option. In 2026, the JS/TS ecosystem ships:
- Streaming CSV/TSV at every level (papaparse, csv-parse, fast-csv)
- XLSX with merged cells and multi-sheet (exceljs, sheetjs)
- JSONata, JQ, Jolt — all available
- CSVW reference implementation in TS
- File-format parsers for OSCAL JSON, OSCAL XML
- Yjs for CRDT
- isomorphic-git for git operations
- AJV for JSON Schema validation
- Sigstore-js for attestations
- Multiple SHA-256 / Merkle libs
Composing these into a Tier-1-aware recipe runtime is the Path C scope. What’s the actual novel code required at this composition layer, and what’s its maintenance cost?
Ch 20a estimated ~150 KB of “recipe runtime” beyond the embedded peer deps. This is a manageable surface area for a small project. But the analysis needs to verify by sketching the modules:
- Recipe parser/validator (~50 KB)
- LogicalSourceReader dispatcher (~30 KB)
- TermMap/Bind evaluator (~30 KB)
- Join executor (~20 KB)
- Lens auto-inverter for round-trippable subset (~20 KB)
- ≈ 150 KB total
Is this itself an “ETL engine” or just a Tier-1-shaped composition layer? The terminology distinction matters because it changes the build-vs-buy framing:
- If we’re “building an ETL engine,” we’re competing with dbt/dlt/Singer
- If we’re “writing a Tier-1 composition layer over standard JS components,” we’re nobody’s competitor — we’re a vault-shaped data pipeline that happens to use the right primitives
The deliverables lean toward the second framing but don’t make it explicit.
5. The protocol-surface opportunity (Path D)
Section titled “5. The protocol-surface opportunity (Path D)”The user’s “external connections, doesn’t need to be logic that lives in Obsidian” framing opens a fifth path that’s not really build-vs-buy:
If Crosswalker defines the import-protocol surface clearly enough, multiple ETL engines can implement it. The protocol becomes the durable thing; the engines become commodity implementations.
This is exactly how:
- The HTTP protocol survives changes in web servers
- The MCP protocol survives changes in agent runtimes
- The PostgreSQL wire protocol survives engine swaps (CockroachDB, YugabyteDB)
- The S3 protocol has a dozen compatible implementations
If Crosswalker writes a v0.1 reference implementation in pure TS and publishes the protocol spec, and an external system can target the same protocol with a different engine (Python+dlt, Java+RMLMapper, etc.), then the build-vs-buy question dissolves. Crosswalker isn’t building an ETL engine; it’s defining a protocol and shipping one (small) reference implementation.
Investigate: how realistic is this? What does the protocol look like? What are the equivalents (in the data-engineering world) — is anyone else doing this?
6. Adversarial: what if Path A (full build) is correct after all?
Section titled “6. Adversarial: what if Path A (full build) is correct after all?”The deliverables and the synthesis log all lean Path C, but be honest about the case for Path A:
- Path A gives the smallest dependency surface (most things stable forever)
- Path A means Crosswalker controls every behavior (no surprise version bumps from upstream)
- Path A gives the cleanest Tier-1-specific optimizations (no impedance mismatch with general-purpose engine assumptions)
- Path A’s “we authored ~480 KB of TS” is comparable to many active OSS projects
What’s the strongest argument for Path A winning over Path C? Run that argument adversarially. Don’t dismiss it.
7. Adversarial: what if Path B (full buy) is correct?
Section titled “7. Adversarial: what if Path B (full buy) is correct?”Equally honestly: what’s the strongest case for adopting an existing ETL framework wholesale — say, dlt — and writing only the destination layer that emits Tier-1 markdown?
- dlt has a maintained Python ecosystem, schema inference, schema evolution, incremental loads, sources for hundreds of formats
- dlt can be invoked as a CLI from any language (TypeScript / Bun could shell out to dlt for the heavy lifting)
- The destination layer (Python writes JSON; TS converts JSON to Markdown) is small
- dlt’s audit / observability story is mature
What’s the case for this, and what specifically makes it “not tenable” (the user’s framing)? The “not tenable” claim deserves explicit defense.
8. The maintenance triangle
Section titled “8. The maintenance triangle”Every architectural choice on this question puts Crosswalker somewhere on a maintenance triangle:
Each corner has different failure modes:
- Ours-to-fix: every bug is ours; no upstream lifeline; but no surprise breakage either
- Theirs-to-fix: dependency on upstream maintainer goodwill; their bug is our bug
- Ours-to-glue: small bespoke surface, large composed dep tree; fragility at integration points
- Their-protocol-but-anyone-implements: hardest to bootstrap (need the protocol AND a reference impl); most durable long-term
For each path, identify the realistic 3-year and 10-year failure modes. Crosswalker has stated a “DURABLE / built from the ground up to solve problems” goal; “durable” means the architecture survives upstream pivots.
Success criteria for the deliverable
Section titled “Success criteria for the deliverable”The deliverable should produce:
-
Honest taxonomy with scored matrix. Every candidate engine evaluated on: license / governance / bundle size / TypeScript-native / target-format compatibility / maintenance signal / opportunity cost / 5-year-survival probability.
-
Explicit verdict on each of the four paths. Not “Path A wins” but “Path A wins for X under conditions Y and Z; Path B wins for X under different conditions; etc.” Conditional recommendations are honest; unconditional ones are usually wrong.
-
Recommended path with justification at the maintenance-triangle level. Where on the triangle is Crosswalker, and why is that the right corner for this project’s resource and timeline constraints?
-
Concrete migration / start path. If the recommendation is Path C, which components are bought (papaparse, JSONata, …) and which are built? Sketch the buy-vs-build decisions per module.
-
Disposition for ChunkyCSV + JSONaut. Port logic in? Embed as deps? Leave separate? Retire? Justify against the build-vs-buy frame.
-
The protocol-surface answer. Is Path D realistic for v0.1 / v1.0 / v2.0? If realistic, what’s the protocol skeleton?
-
Adversarial honesty section. What’s the strongest case for the opposite of the recommended path? Address it directly.
-
5-year survival projection. Under what specific upstream-failure scenarios does the recommendation become non-viable? What’s the exit path in each scenario?
What this does NOT need to answer
Section titled “What this does NOT need to answer”- Implementation specifics (data structures, performance, library choices) — those are downstream of the path decision.
- Re-evaluation of the Ch 20 transformation-algebra primitives (Source/Term/Map/Join/Function over the closed sink vocabulary) — those are settled at the shape level; this challenge is about the implementation strategy.
- Path A vs Path B for v0.1 schema authoring (that’s the deferred wargaming question from the Ch 20 synthesis log) — this challenge is at a different layer.
Out of scope
Section titled “Out of scope”- The boundary semantics layer (
ref/resolve/bind/sealfrom Ch 20b) is its own architectural concern and may need a separate brief later. This challenge focuses on the transformation-algebra layer’s build-vs-buy. - Tier 2 / Tier 3 storage engines — those were settled by Ch 11/14/16.
- The Datalog tier (Nemo) — settled by Ch 12.
Relationship to prior challenges
Section titled “Relationship to prior challenges”- Strictly upstream of Ch 20. Ch 20 picked the primitives; this picks who provides them.
- Compositionally analogous to Ch 11 / Ch 14 / Ch 16 (engine surveys for Tier 2/3). Same governance/sustainability/build-vs-buy lens, applied to a different layer. The lessons learned (foundation governance > single-vendor; “small enough to fork” property) should apply directly.
- Validates / pressures Ch 19’s over-engineering frame. If Path A or full-Path-C is genuinely warranted, Ch 19’s “radically simplify” verdict has a limit; if Path B or Path D is warranted, Ch 19’s logic extends naturally to the import side.
Related
Section titled “Related”- Ch 20 synthesis log — the primitive set this challenge picks an implementation for
- Ch 20a deliverable: T1TMA — implicit Path C recommendation; verify its reasoning
- Ch 20b deliverable: Boundary semantics — Path D protocol-surface insight
- Ch 20c deliverable: 5+4 primitive set — convergent on Path C
- Ch 19 deliverable: Over-engineering stress test — adversarial framing this challenge inherits
- v0.1 stack-pivot log — the “DURABLE / built from the ground up” directive comes from here
- What makes Crosswalker unique — opportunity-cost framing depends on what’s actually unique
- User’s prior tools: ChunkyCSV + JSONaut — practical-precedent touchstones
- Ch 11 / 14 / 16 archives — governance-and-sustainability lens precedents