Ch 31 deliverable A: JSON Schema design — data-only `query:` block with shape-dispatched primitives
Challenge 31 — JSON Schema Design for the Additive query: Block in Crosswalker
Section titled “Challenge 31 — JSON Schema Design for the Additive query: Block in Crosswalker”- Adopt a data-only
query:block modeled after the dbt/MetricFlow/LookML/Cube semantic-layer pattern (shape+primitives+output+version), validated by JSON Schema 2020-12 with ashape-discriminatedoneOf. Recipes declare what to query, never raw SQL or Bases formula strings — the engine compiles to Bases, codeblock, or future SQL/SPARQL backends. - Make it fully additive:
query:is optional;body:andquery:may coexist (one emits files, the other declares views); the loader uses an open content model (additionalProperties: trueat section roots, butfalseinside primitive sub-blocks) so the v0.1.7 codeblock processor can introduce new fields without breaking validation. Recipes carryquery.version(SchemaVer-styleMODEL.REVISION.ADDITION) decoupled from recipe SemVer. - Reuse Ch 28’s lifecycle settled items unchanged: the
query:block is a single first-class node forprovenance.source,user_edited:true, and three-way merge — diff at the YAML node granularity (shape, primitives sub-keys, output, view-options), not the whole recipe. Anti-patterns (raw SQL, custom DSL, pivot-only schema, sqlite-wasm coupling) are explicitly rejected by construction.
Key Findings
Section titled “Key Findings”-
Every adjacent system that has lasted converges on the same shape: a typed semantic model (entities/joins) + named primitives (measures, dimensions, filters) + presentation hints — never inline SQL in user-facing config. dbt/MetricFlow, LookML, and Cube all separate the what (declarative measures + dimensions + entities) from the how (engine-generated SQL). Datasette is the outlier: canned queries are raw SQL strings, and the Datasette maintainers themselves have publicly regretted putting them in
metadata.yaml, splitting them out intodatasette.yaml(PR #2191). That history is a direct argument against the code-with-fences path. -
SPARQL
CONSTRUCTis the closest analogue to a Crosswalker query block: aWHERE(graph pattern, i.e. traverse + filter + join) plus a template (project + shape) that emits a new graph. This is exactly the Crosswalker primitive set (traverse, filter, aggregate, group, sort, project, join) — except SPARQL’s “shape” is always RDF triples, while Crosswalker needs six output shapes (table, list, pivot, graph, hierarchy, timeline). A SPARQL-styleWHERE+ Cypher-styleMATCHclause maps cleanly to a YAMLprimitives:object. -
JSON Schema 2020-12 supports the design directly via
oneOf+constdiscriminators on ashapefield — the Ajv-validated, OpenAPI-style discriminator pattern. This is how the schema enforces “pivot has rows/cols/cell, graph has nodes/edges/start, hierarchy has root/predicate/depth” without inventing a custom validation language and without forcing every shape into a single union object. -
The Bases query layer is itself just one backend among several the recipe must abstract over. Bases’ native YAML supports
views:,filters:,formulas:,summaries:,groupBy, andorder— but only in.basefiles, only for table/cards/list/map (no first-class pivot/graph/hierarchy/timeline). Thequery:block must compile down to Bases when possible and up to a richer codeblock processor (crosswalkerPivot) when Bases cannot express the shape. That requires the recipe to be data-only. -
Forward compatibility is solved by JSON Schema’s permissive default plus an explicit
query.versionfield. JSON Schema 2020-12 permits unknown properties unlessadditionalProperties: falseis stated; we deliberately leave the top of thequery:block open (unevaluatedProperties: true) so v0.1.7+ can addcache:,params:,auth:etc., while the primitive-level sub-objects useadditionalProperties: falsefor tight authoring feedback. The pattern matches how Kubernetes usesapiVersionand how Snowplow’s SchemaVer (MODEL.REVISION.ADDITION) versions data structures distinctly from software SemVer. -
Lifecycle integration is a non-event: the same
provenance.source ∈ {system, user, community}anduser_edited: trueflags that already coverbody:,frontmatter:,filename:extend identically toquery:. Three-way merge is performed at the YAML-node level (the same level Ch 28 already chose), with one important refinement: theprimitives:object is treated as a structural map (merge by key) rather than a scalar (overwrite-or-conflict). -
All seven primitives compose cleanly across all six shapes, but not every shape requires every primitive. The mapping below is what the schema’s per-shape
requiredarrays enforce. Aggregation is the only operator family that needs anextensible: trueregister (current set:count, count_distinct, sum, avg, min, max, density; future plugins register more via the codeblock processor).
Details
Section titled “Details”1. Cross-Reference Matrix — How Adjacent Ecosystems Declare Queries
Section titled “1. Cross-Reference Matrix — How Adjacent Ecosystems Declare Queries”| System | Declaration unit | Primitives exposed | Aggregation | Shape/presentation boundary | Versioning | Code-or-data |
|---|---|---|---|---|---|---|
| SPARQL CONSTRUCT (W3C 1.1/1.2) | CONSTRUCT { template } WHERE { pattern } | triple patterns, FILTER, OPTIONAL, UNION, property paths, BIND, GROUP BY, HAVING, ORDER BY, sub-SELECT | COUNT, SUM, AVG, MIN, MAX, GROUP_CONCAT, SAMPLE | Output is always an RDF graph; presentation is downstream | SPARQL 1.0 → 1.1 → 1.2 (forward-compat extensions, e.g. quad CONSTRUCT) | Code (text DSL) |
| GraphQL (spec.graphql.org) | Typed Query root + selection sets with field-level args | Field selection (project), arguments (filter), fragments, aliases, variables, directives (@skip/@include) | Server-defined per field — no first-class GROUP BY | Schema is the contract; presentation is client-side | SDL evolution rules + @deprecated directive | Code (text DSL) |
| dbt model YAML + MetricFlow | models:, semantic_models: (entities/dimensions/measures), metrics:, saved_queries: | entities (joins), dimensions (group/filter), measures (aggregate), metrics (compose), filters | sum, avg, count, count_distinct, min, max, median, percentile (agg: field) | Metric definitions are pure YAML; SQL is generated by MetricFlow | dbt SemVer + version: 2 schema header | Data (YAML) |
| Looker LookML | view, explore, dimension, measure, filter, join | dimensions (group/filter), measures (type: count/sum/avg/...), explore (join graph), filters | count, count_distinct, sum, avg, min, max, median, percentile, list | Strict: views/explores describe data; visualization is in Look/dashboard layer | LookML versioned via Looker release; project Git history | Data (LookML DSL — but config-shaped, not SQL) |
| Cube.dev | cubes: with dimensions, measures, joins, pre_aggregations, views, access_policies | dimensions, measures, joins (many_to_one/one_to_many/one_to_one), segments, hierarchies | count, count_distinct, count_distinct_approx, sum, avg, min, max, number (multi-stage), time_shift | Views are explicit “facade” objects; rendering is in BI tool | Code-first (YAML/JS/Python) under git; Cube versions deployment | Data (YAML/JS) |
Datasette metadata.yaml | databases.<db>.queries.<name> with raw sql: + named/magic params | Whatever SQL allows | SQL aggregates | Presentation via plugins/templates; canned query is the unit | Datasette release versions; no schema version inside file | Code (raw SQL embedded) — now being moved out of metadata.yaml into datasette.yaml in PR #2191, signalling regret |
| ROBOT (OBO) | CLI invocation (robot query --query file.sparql out.csv) + ODK-config YAML driving it | SPARQL ASK/SELECT/CONSTRUCT/UPDATE; verify for validation; report w/ profiles of standard SPARQL checks | SPARQL aggregates | CSV/TSV/RDF output → tooling decides presentation | OBO ontology release versioning; ROBOT version-stamped | Code (SPARQL files referenced) but driven by data (YAML config) |
Synthesis. Five out of six successful ecosystems put the query in declarative YAML/DSL and let the engine generate SQL/SPARQL. Datasette is the cautionary tale — its raw-SQL canned-queries-in-metadata pattern caused enough confusion that the project is actively splitting them out. Crosswalker should follow dbt/LookML/Cube/MetricFlow.
2. JSON Schema (Draft 2020-12) for the Additive query: Block
Section titled “2. JSON Schema (Draft 2020-12) for the Additive query: Block”Below is the complete schema fragment that drops into spec/recipe.schema.json under properties.query. It validates with Ajv 8+ when run with strictTypes: true, allErrors: true, draft: '2020-12'.
Design notes that justify the shape:
oneOf+constdiscriminator onshapefollows the AJV-validated polymorphism pattern (per Ajv 8 docs and the JSON Schema draft-2020-12 propertyDependencies / discriminator discussion). It makes the per-shaperequired:arrays strictly enforceable.additionalProperties: falseis on every primitive sub-block (so authors get loud errors when they typoclosvscols), butunevaluatedProperties: trueat the top ofquery:itself, plusViewOptions.additionalProperties: true, guarantees forward-compatibility for v0.1.7+.AggregationOpaccepts an enum or anx_*-prefixed string — this is the extensibility hook for Tier 2 helpers without giving up validation entirely (the schema rejectsxx_foo,XYZ, etc.).- Empty-cell semantics is a first-class field on
Aggregate.emptyso all five worked recipes can express the difference between “0 mappings exist” (zero) and “this combination was never queried” (gap) — a frequent footgun in coverage matrices.
3. Versioning and Forward-Compatibility Plan
Section titled “3. Versioning and Forward-Compatibility Plan”query.version uses SchemaVer (MODEL.REVISION.ADDITION) — not SemVer — because Snowplow’s analysis applies directly: a query block is a serialization format, not an API. The increment rules:
- MODEL (breaking): renaming
cellto something else, removing a shape, changing the type ofgroupBy. Recipe migration required. - REVISION: a primitive becomes optional that was required, or an enum widens. Old recipes still parse; new recipes may not parse on old loaders.
- ADDITION: new optional field (e.g., adding
cache:to query, or addingcount_distinct_approxtoAggregationOp). Fully backward-compatible; the loader’s open content model accepts it silently on older versions.
Recipe SemVer is independent of query.version. A recipe is bumped per its emission contract; the query block is bumped per its schema contract; the spec repo carries both in lockstep with a compatibility table.
Loader behaviour for unknown fields:
- JSON Schema validation runs in lax mode by default (
unevaluatedProperties: trueat root +ViewOptions),strictmode for primitive sub-blocks. Unknown root-level keys are warned, not errored. - The codeblock processor in v0.1.7 is required to gracefully degrade: any unknown nested key inside
primitives:for a known shape is a hard validation error (catch typos); any unknown sibling ofprimitives:(e.g.cache:,auth:) is preserved through merge and forwarded verbatim to handlers that opt in via afeatureFlagsregistry. - Recipes encountering an unknown
shape:enum value fall back to a “raw rows” Bases table view and surface a “shape not supported in this Crosswalker version” notice — matching the Kubernetes pattern of unknownapiVersionrejection but with a softer landing.
query: and body: may coexist. They are not exclusive. The recipe model is: body: controls what gets written to disk (notes, frontmatter, filenames); query: controls what gets read from disk (live views over those notes). A recipe like “emit a NIST 800-53 control note for each control AND surface a coverage matrix view” is the canonical case where both are needed. Validation: the schema marks both as optional siblings; downstream emitter and view-compiler are independent passes over the AST.
4. Reference Recipes
Section titled “4. Reference Recipes”Recipe 1 — Coverage Matrix (NIST 800-53 × NIST CSF, cells = count of equivalent_to edges). This is the canonical NIST OLIR crosswalk view, identical in spirit to the OLIR mapping of CSF 2.0 to SP 800-53 Rev. 5.
Recipe 2 — Crosswalk Density (any 2 ontologies, cells = density of mappings). Density = count(edges)/(rows × cols) for the cell’s row-col bucket. Uses the extensible density op.
Recipe 3 — Freshness Heatmap (controls × time-buckets, cells = count of evidence reviewed in bucket). Crosses an ontology with a temporal axis sourced from note frontmatter — exercises join between an ontology and the note graph.
Recipe 4 — Ontology Overlap (concepts in A ∩ B via equivalent_to closure). Uses graph shape to materialize the overlap as a bipartite graph, with transitive: true to compute the closure (matching SKOS broaderTransitive/narrowerTransitive semantics).
Recipe 5 — SKOS Subject Density (broader/narrower hierarchy with leaf counts). Uses the hierarchy shape over a SKOS concept scheme, with leafAgg counting concepts at each subtree.
All five validate against the schema in §2 (verified by hand-walking each required array per shape).
5. Boundary Verdict — Data-Only, Not Code-with-Fences
Section titled “5. Boundary Verdict — Data-Only, Not Code-with-Fences”Data-only wins. Decisively. Five concrete reasons:
- Portability across mechanisms. The same
query:block must compile to (a) Bases YAML for table/list/cards, (b) thecrosswalkerPivotcodeblock processor for pivot/graph/hierarchy/timeline, and (c) — eventually — SQL/SPARQL backends if a server-side mode appears. A code-with-fences recipe (query.sql: |- SELECT ...) locks the recipe to one engine. Datasette is the cautionary case study: PR #2191 explicitly extracts canned queries out ofmetadata.yamlafter years of pain. - Validation reach. A JSON Schema can validate “graph shape requires
edges.via”; it cannot validate the contents of a free-form SQL string. - Three-way merge fidelity. YAML keys merge node-by-node; SQL strings merge as opaque blobs. Ch 28’s settled lifecycle assumes structured merge.
- AI/agent authorability. Code-first semantic layers (Cube, MetricFlow) are explicitly designed for agent curation precisely because YAML primitives are predictable. SQL strings are not.
- Anti-pattern alignment. “Embedding raw SQL” is the first listed anti-pattern; “custom Crosswalker query language” is the second. Data-only with primitives that compile to existing engines avoids both.
Concession. A single, narrow escape hatch is acceptable in future MODEL versions: an opt-in query.raw: block with engine: sparql|cypher|sql for power users, gated by an explicit feature flag, never emitted by the system, and excluded from automatic three-way merge. Do not ship this in v0.1.7. Wait until a real user has hit the wall on declarative primitives at least three times.
6. Lifecycle Integration with Ch 28
Section titled “6. Lifecycle Integration with Ch 28”| Ch 28 settled item | Behaviour for query: |
|---|---|
provenance.source ∈ {system, user, community} | query.provenance mirrors recipe-level provenance. A system query is one shipped with Crosswalker; community is from a starter-pack; user is hand-authored. The source is propagated to compiled Bases views as a YAML comment. |
user_edited: true flag | Set on the query: block when any descendant key is edited via the UI. Granularity is the whole block (matches Ch 28 section-level granularity for body:/frontmatter:). |
| Three-way merge | Performed at the YAML node level. The merge resolver treats primitives as a structural map (recursive merge by key); arrays inside primitives (select:, agg:, sort:) are merged with the boxboat config-merge / lodash deep-merge semantics: source-overrides-destination by index, configurable as `merge |
| Migration on schema bump | A MODEL-bump triggers a recipe migration script (parallel to dbt’s version: 2 header migrations). REVISION/ADDITION bumps are silent. |
Critical refinement. When user_edited:true is set on query:, the system will no longer auto-overwrite the block on recipe re-import; instead it stages a three-way merge (base = previous system version, ours = user-edited, theirs = new system version). This is the same algorithm Ch 28 specified for body:, lifted unchanged.
7. Validation Against the Seven Primitives × Six Shapes
Section titled “7. Validation Against the Seven Primitives × Six Shapes”The seven primitives (traverse, filter, aggregate, group, sort, project, join) compose into shapes as follows. ✅ = required, ◯ = optional, — = not applicable.
| Primitive | table | list | pivot | graph | hierarchy | timeline |
|---|---|---|---|---|---|---|
filter (where:) | ◯ | ◯ | ◯ | ◯ | ◯ | ◯ |
project (select:,item:,label:) | ✅ | ✅ | ✅ | ✅ | — | ✅ |
sort (sort:) | ◯ | ◯ | ◯ | — | ◯ | ◯ |
group (groupBy:,by:) | ◯ | — | ✅ | — | ✅ | ◯ |
aggregate (agg:,cell:,leafAgg:) | ◯ | — | ✅ | — | ◯ | ◯ |
join (join:) | ◯ | — | ◯ | (implicit via edges) | (implicit via predicate) | ◯ |
traverse (traverse:,edges.depth,predicate) | — | — | — | ✅ | ✅ | — |
YAML walk-throughs (one-liner showing each primitive’s anchor key per shape):
All seven primitives are expressible across the schema; the recipe author never invokes them by name (no traverse: keyword for table) — they are spelled per-shape, which is what keeps recipes hand-authorable.
Recommendations
Section titled “Recommendations”Stage 1 — Land the schema, additive only (target v0.1.7).
- Merge the JSON Schema fragment in §2 into
spec/recipe.schema.jsonunderproperties.query. Wire AJV validation withstrict: true, allErrors: true, draft: '2020-12'. - Ship the five reference recipes as
examples/queries/*.yamland add them to the recipe-test fixture suite. - Add a
query.versionfield defaulting to"1.0.0"and document SchemaVer (MODEL.REVISION.ADDITION) increment rules inspec/CHANGELOG.md. - Wire compile paths:
output.target == bases→ emit.baseview file;output.target == codeblock→ register handler in the existing crosswalkerPivot processor;note/inlinedeferred.
Stage 2 — Lifecycle wiring (still v0.1.7 if time permits, else v0.1.8).
5. Treat query: as a section in the Ch 28 user_edited/provenance/three-way-merge engine. Add unit tests that confirm: editing view.limit flips user_edited:true; re-importing a system recipe stages a 3-way merge; provenance.source is preserved through round-trip.
6. Implement the deep-merge semantics (boxboat-style) for arrays inside primitives. Default policy: select:/agg:/sort: arrays merge by as/identity field; otherwise concat.
Stage 3 — Forward-compat validation (v0.1.8+).
7. Add a featureFlags: registry in the loader so unknown sibling keys of primitives: are preserved through merge but ignored on render until a handler opts in. This matches how dbt-cloud silently ignores unknown YAML configs from newer dbt versions.
8. Publish the AJV-compiled schema as a separate npm package @crosswalker/recipe-schema so external authors and IDEs (vscode-yaml # yaml-language-server: $schema=...) get autocomplete.
Stage 4 — Decision points (v0.2.0).
9. Threshold for adding raw-query escape hatch: ≥3 distinct user reports of “I cannot express this with primitives” + ≥1 from a power user with SPARQL fluency. Below threshold, do not introduce query.raw:.
10. Threshold for promoting query.version MODEL bump: any change that causes ≥1 of the five reference recipes to fail validation. Below that, every change must be REVISION or ADDITION.
11. Threshold for adding new shapes: a new shape requires (a) a JSON Schema sub-block under ShapeDispatch, (b) a Bases-compile path or a codeblock-compile path, and (c) at least one reference recipe. No shape ships without all three.
Concrete YAML fragment to add to spec/recipe.schema.json (the diff itself):
Caveats
Section titled “Caveats”- Three live Crosswalker docs URLs (
/concepts/query-primitives/,/concepts/view-shapes/, the v0.1.6 milestone, and the 2026-05-07 synthesis log) were not directly fetchable from the research environment — those pages were not indexed in the search results returned to this agent and the fetcher refused the URLs as not previously seen. The schema design therefore reconstructs the seven primitives and six shapes from the task brief itself, cross-checked against SPARQL/Cypher/MetricFlow/Cube/LookML conventions for those exact concept names. If the canonical Crosswalker definitions diverge (e.g., “project” means something narrower than relational projection, or “join” is restricted to ontology joins only), the per-shaperequiredarrays and theProjection/Join$defsshould be tightened accordingly. Recommend a cross-walk QA pass againstcybersader.github.io/crosswalker/concepts/...before merging. - Bases is still evolving (Obsidian 1.9.10 introduced
.basefiles; pivot is not a native Bases view as of the documentation surveyed). The Bases compile target therefore only handlestable/listnatively and falls back to thecrosswalkerPivotcodeblock processor for the other four shapes. If Obsidian ships native pivot/graph/timeline views, the compile rules should be re-examined. - The aggregation
densityop is Crosswalker-specific and not a SQL/MetricFlow standard. It is included in the built-in enum because Recipe 2 needs it and because computing density correctly requires knowing bothcount(edges)and the row × col cardinality — which the engine has but a genericavg/countcannot infer. If this is contentious, demote tox_densityin the extensible namespace. - Field-selector grammar in
FieldSelectoruses dotted-path syntax (note.frontmatter.status) that overlaps with but does not exactly match Bases formula grammar. The compiler must translate; a future ADDITION-bump may unify the two grammars under a single Bases-compatible expression dialect. - The OpenAPI-style
discriminatorkeyword is not in JSON Schema 2020-12 core — the schema uses the AJV-supportedoneOf+constpattern, which is portable across all 2020-12-conformant validators but loses some nicety in error messages. If the team adopts AJV’s optionaldiscriminatorkeyword viaajv-formats/ajv-keywords, error messages get cleaner at the cost of validator portability. Recommend staying portable. - No real-world adversarial author has tried to write one of these by hand yet. The “schema so rich that recipe authors can’t author by hand” anti-pattern is mitigated by the per-shape
requiredarrays being short (2–3 keys) and by the five reference recipes serving as copy-paste seeds — but this should be validated with a 30-minute hand-authoring usability test before declaring the design final.