Skip to content
🚧 Early alpha — building the foundation. See the roadmap →

Datalog

Updated

Datalog is a declarative query/rule language: you write rules about what’s true, the engine figures out everything that follows. It’s been around since the 1970s (originated as a sublanguage of Prolog) and is the theoretical underpinning of how relational databases evaluate joins.

In Crosswalker it’s used for one specific job — deriving new SSSOM mappings from existing ones. If NIST CSF Identify maps to ISO 27002 Risk Assessment and ISO 27002 Risk Assessment maps to MITRE ATT&CK Reconnaissance, Crosswalker can derive NIST CSF Identify → MITRE ATT&CK Reconnaissance automatically with a confidence score that’s the minimum of the two source confidences.

You write that rule once:

derived(a, c, min(c1, c2)) :- mapping(a, b, c1), mapping(b, c2, _).

The Datalog engine handles the rest — including chains of any length without you having to write a loop.

The alternative is plain SQL with WITH RECURSIVE (a “recursive CTE”). Both can express transitive closure. Datalog wins on three things that matter to Crosswalker:

  1. Stratified negation — Datalog can say “find mappings A→C that are NOT derivable through any intermediate B” naturally. SQL recursive CTEs can fake this with NOT EXISTS / LEFT JOIN but the queries get ugly fast.
  2. Aggregation inside recursion — Datalog rules can compute min(confidence_a, confidence_b) as part of the rule. Recursive CTEs handle this but with awkward window functions.
  3. Magic-set rewriting — modern Datalog engines (Nemo, CozoDB) automatically optimize chain queries by working backward from what you asked for. Recursive CTEs don’t do this; they materialize the full transitive frontier.

For Crosswalker at scale (10⁶+ mappings with branching factor over 5), the difference is real. At small scale (≤10⁵ mappings) recursive CTE is fine — which is why Tier 2-Lite drops Datalog and uses recursive CTE instead.

Per the Ch 12 Datalog vs SQL deliverable, Crosswalker uses Nemo — a Rust Datalog engine from TU Dresden with native WASM support, stratified negation, existential rules, and W3C-tested OBDA semantics. It’s the Datalog engine in the Tier 2 layered stack.

Sister Datalog engines on the watchlist:

  • CozoScript (CozoDB) — superset of Datalog with built-in graph algorithms (PageRank, Dijkstra, Yen-K) and HNSW vector queries integrated as first-class joins. Rejected for adoption due to maintenance signal weakening — see Ch 14 §2.3.
  • WOQL (TerminusDB) — Datalog with path(), dot(), slice() operators. Relevant only if TerminusDB is adopted.
  • Datalevin (Clojure/JVM) — best-in-class query optimizer; not WASM, so out of scope for Tier 2.
  • Minigraf (bi-temporal Datalog) — pre-1.0, single-maintainer; on the Ch 14 trigger B watchlist for SSSOM bi-temporal history queries.

How it relates to other Crosswalker query layers

Section titled “How it relates to other Crosswalker query layers”
Crosswalker query stack (where each language lives)

Tier 2 (browser, full stack)
├── DuckDB-WASM        →  SQL (analytics, joins, Parquet)
├── Oxigraph-WASM      →  SPARQL (RDF, federated queries)
└── Nemo-WASM          →  DATALOG (SSSOM chain-rule derivation)  ← us

Tier 2-Lite (low-end / Obsidian Mobile)
└── sqlite-wasm        →  SQL with recursive CTEs (Datalog substitute)

Tier 3 (server)
├── Apache Jena Fuseki →  SPARQL
├── oxigraph-server    →  SPARQL (same engine as Tier 2 Oxigraph)
├── DuckDB-on-server   →  SQL + SQL/PGQ
└── (optional) TerminusDB →  WOQL Datalog

In words: Datalog fills a gap that SQL and SPARQL each do imperfectly. SQL is great at joins but awkward at recursion. SPARQL is great at graph patterns but limited at rule-level aggregation. Datalog handles both.

  • Tier 2-Lite: bundle size matters more than expressivity. Use recursive CTE.
  • Local user-facing queries: SPARQL is easier to write by hand and Oxigraph is fast enough. Use SPARQL.
  • Analytical rollups (coverage matrices, predicate distributions): use DuckDB SQL.
  • Federation across multiple endpoints: Comunica + SPARQL SERVICE is the right abstraction.

Datalog is reserved for derivation rules — the place where its expressivity premium pays off.