Skip to content
🚧 Early alpha — building the foundation. See the roadmap →

Challenge 26: Transform engine depth + GUI line + input format roster

Created Updated

During v0.1.5 planning (right after Ch 25 two-mode architecture decision settled), three interlocking questions surfaced that the two-mode architecture didn’t resolve:

  1. How deep does the in-plugin transform engine go? The bundled engine ships with a closed 7-filter set (lower/upper/title/slug/tagsafe/fs-safe/truncate). What about everything else — column rename, regex extract, conditional logic, joins, aggregations, lookup tables, fuzzy matching, flat-to-tree recovery? In-plugin or external?

  2. Should we port JSONaut and/or ChunkyCSV? The user authored both tools specifically to handle CSV/JSON ETL at scale. Porting their useful parts into the plugin gives in-plugin power. Or do existing TS-based libraries cover this?

  3. Is there a “midway” input format? Between “raw user CSV” and “fully cleaned input ready for the bundled engine” — could JSONL or JSON-with-iterator-path be that midway, making external ETL composition easier?

The user surfaced #3 explicitly: “Is there not like a mid-way — like if you put it in JSON and specify entry points or whether it’s JSONL or whatever, then MODE 1 can process easier? Idk. Might be a dumb idea and not make sense.”

The instinct was right. JSONL is the sweet spot.

OptionWhatEffortRisk
A: Port JSONaut/ChunkyCSV core into TSReimplement the useful parts as TS modules in the pluginLARGE — rewriting two non-trivial toolsMaintenance debt; their feature surface evolves outside Crosswalker; can become bigger than Crosswalker
B: Use existing TS librariesJSONata for JSON transforms (already committed in Ch 23 §6); PapaParse + custom column ops for CSVMEDIUMNo full coverage of “messy → clean” automation; user still escapes to external tools sometimes
C: Keep Mode 1 narrowBundled engine accepts already-clean input; in-plugin transforms stay closed (7 filters); users use ChunkyCSV/JSONaut/dbt/Polars when they need real transformsSMALL — already shippedUsers without external tooling have a worse experience for messy sources
D: Hybrid (Ch 23 §6 commit)Closed primitives + JSONata 2.x as expression sub-language + import wizard provides GUI for common transforms; users escape to external tools for messy sourcesMEDIUM — wires JSONata, designs UI for common casesRight-size; matches Ch 23 commitment
E: Full transform IDEUI for transforms, debugging, profiling, output preview, joins UI, conditional logic UIHUGECould exceed Crosswalker itself in scope; drift into a different product
FormatAlready?Stream?Recommended phase
CSV✅ v0.1Already shipped
JSONL✅ Trivialv0.2 — the midway the user identified
JSON with iterator path✅ via stream-jsonv0.2 — for OSCAL bundles + nested ontologies
Plain JSON arrayv0.2 (degenerate iterator path)
XLSX⚠ Partial⚠ Sheet-by-sheetv0.3+ completion
XML / RDF / OWL⚠ saxv0.3+ if user demand justifies
YAML / TOML / Parquet / ArrowvariesNot committed
PhaseWhat the wizard offersWhat stays in recipe / external
v0.1 (shipped)Column-role config; closed 7-filter templatesEverything else
v0.2+ Column rename; + value trim; + regex extract; + simple splitConditional logic, joins, lookups, flat-to-tree, fuzzy matching → JSONata or external tools
v0.3+ JSONata expression cells in wizard for advanced usersMulti-source, time-series, complex pipelines → external tools
v1.0+ (likely never)Full transform IDE

What to investigate (preserved for posterity)

Section titled “What to investigate (preserved for posterity)”

1. JSONata’s coverage of JSONaut features

Section titled “1. JSONata’s coverage of JSONaut features”

Comparing JSONata 2.x’s expression vocabulary to what JSONaut typically does:

JSONaut patternJSONata equivalentCoverage
Field projection ({a, b, c}){ "a": $.a, "b": $.b, "c": $.c }✅ Native
Conditional projection(condition) ? value1 : value2✅ Native
String manipulation$uppercase($), $substring($, 0, 10), $replace(...)✅ Native
Array operations$map($, fn), $filter($, fn), $reduce(...)✅ Native
Aggregations$sum($), $count($), $average($)✅ Native
Joins (lookup table)lookup{key} syntax✅ Native
Regex$match(str, /pattern/), $replace(str, /pattern/, replacement)✅ Native
Date arithmetic$now(), $millis(), $fromMillis()✅ Native
Custom functionsFunction definitions inline✅ Native

Coverage estimate: ~80% of typical JSONaut workflows. The 20% gap (proprietary JSONaut patterns, specific UI workflows) is recoverable via either: (a) authoring JSONata that does the same job, or (b) using JSONaut externally and producing JSONL for the bundled engine.

ChunkyCSV’s value is largely in streaming + cleanup at scale, not declarative transformation. JSONata doesn’t help with this. The right answer for ChunkyCSV’s use cases:

ChunkyCSV featureIn-plugin alternativeVerdict
Streaming CSV read of multi-GB filesPapaParse streaming (v0.1.4.5)✅ Covered for the read; not the cleanup
Column rename / select / dropv0.2 wizard⚠ Partial (column rename + select; drop covered by skip-column UI)
Flat-to-tree recovery (parent_id columns → tree)JSONata can do it for moderate cases; multi-step requires recipe author skill⚠ Manageable but not GUI-friendly
Joins with lookup tablesJSONata supports lookups for moderate cases; multi-table joins → external⚠ Manageable for one lookup
Regex column synthesisv0.2 wizard regex extract; JSONata regex✅ Covered
Fuzzy matchingNone — out of scope❌ Stays external

Verdict: ChunkyCSV’s value is streaming + multi-GB cleanup + complex joins + fuzzy matching. The first one is in v0.1.4.5; the rest stay external. Users with ChunkyCSV workflows produce a cleaned CSV/JSONL and feed Mode 1.

What makes JSONL the right v0.2 midway:

  • One JSON object per line — newline-delimited
  • Streamable trivially — read line-by-line, JSON.parse each
  • Native types — numbers, booleans, nulls, nested objects all work
  • Self-describing — keys repeat per record (cost) but no separate header (benefit)
  • ETL ecosystem alignment — BigQuery, Spark, Databricks, dbt, JSONaut all emit JSONL natively
  • Trivial parser implementation — ~50 LOC for the streaming parser

Sketch:

async function* parseJSONLStream(file: File): AsyncIterable<Record<string, any>> {
    const stream = file.stream();
    const decoder = new TextDecoder('utf-8');
    let buffer = '';
    for await (const chunk of streamReader(stream)) {
        buffer += decoder.decode(chunk, { stream: true });
        let newlineIdx;
        while ((newlineIdx = buffer.indexOf('\n')) !== -1) {
            const line = buffer.slice(0, newlineIdx).trim();
            buffer = buffer.slice(newlineIdx + 1);
            if (!line) continue;
            yield JSON.parse(line);
        }
    }
    if (buffer.trim()) yield JSON.parse(buffer.trim());
}

Engine doesn’t change. Just consume the AsyncIterable.

For OSCAL bundles and deeply-nested ontologies:

{
  "catalog": {
    "uuid": "...",
    "metadata": {...},
    "groups": [
      {
        "id": "AC",
        "controls": [
          { "id": "AC-1", "title": "...", "params": [...] },
          { "id": "AC-2", "title": "...", "params": [...] }
        ]
      }
    ]
  }
}

Recipe declares: source.iterator: $.catalog.groups[*].controls[*]. Engine uses stream-json to navigate the path lazily and yield each control as a row.

Multiple iterators in one recipe — open question: do we support iterator: { catalogs: $.catalogs[*], controls: $.catalog.groups[*].controls[*] } (named iterators)? Or is one recipe = one iterator? Defer to v0.2 implementation.

5. The GUI depth line — non-coder vs power-user

Section titled “5. The GUI depth line — non-coder vs power-user”

Users break into roughly two cohorts:

Non-coder GRC team (the v0.2 wizard target):

  • Has a CSV from NIST or a vendor
  • Wants column rename, value trim, maybe regex extract
  • Doesn’t want to learn JSONata

Recipe-authoring power user (the v0.3 JSONata target):

  • Has complex transforms
  • Comfortable editing recipe JSON
  • Will learn JSONata (which is simpler than jq, simpler than dbt SQL)

A v1.0 transform IDE would try to serve both via a sophisticated UI. That’s the wrong call — power users prefer text-editing the recipe; non-coders are served by the v0.2 wizard. Building a UI that bridges both well is huge scope.

The transform IDE features that would be useful:

Featurev0.2 / v0.3 alternative
Live preview of transform outputWizard’s existing Step 3 preview (shows first row’s rendered Tier 1 output)
Multi-step pipeline editorRecipe is itself a multi-step declarative pipeline; edit JSON directly
DebuggingJSONata has a public playground for testing expressions
ProfilingGeneration engine reports per-row errors with context
Joins UIJSONata lookup{key} syntax; for complex joins, external tools
Conditional logic UIJSONata ternary (cond) ? x : y
Output diffGit diff on Tier 1 vault before/after recipe edit

Each of these has a v0.2 / v0.3 / external alternative that’s adequate. The transform IDE bundles them into one UI — convenient, but redundant given the alternatives.

The user articulated the question clearly + had the right instinct (“midway via JSONL is real; full transform IDE is too far”). The architectural pieces (Ch 23 §6 JSONata commit, two-mode architecture, schema-as-primitive) already constrained the answer. JSONL is a quick win that fits the schema-as-primitive philosophy (one more parser; engine doesn’t change). The “stop at v0.3” decision aligns with all prior commitments.

Resolution captured in the synthesis log + operationalized via:

  • v0.2 input-format milestone (to be filed) — JSONL + JSON-with-iterator-path
  • v0.2 wizard transform milestone (to be filed) — basic transforms in GUI
  • v0.3 JSONata wiring milestone (to be filed) — expression sub-language