Skip to content
🚧 Early alpha — building the foundation. See the roadmap →

Challenge 12: Datalog vs SQL for SSSOM chain-rule derivation

Created Updated

The Crosswalker pairwise + pivot crosswalk commitment requires a derivation engine — the component that composes pairwise mappings through a pivot/spine using the SSSOM chain rules (22 distinct rules per the SSSOM chain-rules document). Two viable approaches:

  1. Datalog — Native fit. The OxO2 paper (EMBL-EBI, 2025; arXiv 2506.04286) implements SSSOM chain rules in Nemo (Datalog). RCE2 (the rule OxO2 most-uses) is a textbook Datalog rule. Cleaner provenance via Datalog’s native rule-firing trace
  2. Recursive CTE — Broader engine support (every SQL DB has it). Challenge 10’s deliverable picked this path. Less expressive than Datalog for variable-depth traversal with provenance

This is a direct fork in the road. The choice ripples into the Tier 2/3 engine selection (Challenge 11) and into the audit-trail story (a clean derivation_path field per derived edge requires the engine to expose its rule-firing trace).

Express the SSSOM chain rules in both forms:

  • Datalog: write RCE2 (and at least 3 other representative rules) as Datalog clauses with provenance. Show how derived_from (SSSOM PR #537) populates from rule firings
  • Recursive CTE: write the same rules as recursive SQL CTEs targeting DuckDB or Postgres syntax. Show how derivation_path populates as a UDF or a list_append accumulator

Compare:

  • Lines of code per rule
  • Readability for a fresh contributor
  • Edge-case handling (cycles, ambiguity, non-termination)
  • Provenance fidelity — does the rule firing show which rule fired, in what order, on what input?

Theoretical (no benchmarks required, but characterize the cost models):

  • Depth-2 to depth-5 chain composition on representative dataset sizes (10K, 100K, 1M mappings)
  • Cycle detection cost (Datalog’s bottom-up fixed-point vs CTE’s visited array)
  • Recursion-depth limits (Datalog has none; SQL engines vary — Postgres unlimited, DuckDB unlimited, sql.js capped)
  • Memory footprint at fixed-point convergence

Constrain by what engines run in the deployment shapes Challenge 11 will pick:

  • Browser-side (Tier 2): which Datalog engines run in WASM? Nemo has WASM. Soufflé’s WASM story?
  • Server-side (Tier 3): every Datalog engine is available; SQL is universal
  • Cross-tier consistency: if Tier 2 uses Datalog and Tier 3 uses Postgres+SQL, what’s the cost of maintaining two derivation paths?

4. Provenance and audit-trail implications

Section titled “4. Provenance and audit-trail implications”

Challenge 08 (and the eventual Challenge 13 follow-on) require every derived edge to carry an auditable derivation_path:

  • Which input edges contributed?
  • Which chain rule fired?
  • What was the inherited confidence / staleness?

Datalog engines often natively expose this (OxO2 records “derived under rule X” per derivation). SQL recursive CTEs require explicit accumulator construction. Compare:

  • Cleanliness of the resulting derivation_path field
  • Cost of injecting provenance into recursive CTE accumulators
  • Compatibility with SSSOM’s pending derived_from slot
  • Which is easier for a fresh contributor to read, debug, and extend?
  • Which has better tooling (debuggers, profilers, query plans)?
  • Which is more failure-mode-stable under bad input data?
  • Datalog produces materialized derived rows; SQL queries them: Datalog runs at materialization time, results live in a SQL table, downstream queries are pure SQL. Best of both?
  • SQL produces base mappings; Datalog computes closures on demand: opposite layering
  • Both are wrong; use a different paradigm entirely (e.g., Apache Arrow’s stream-graph computation)
  1. Worked example: at least 4 SSSOM chain rules expressed in both Datalog and recursive CTE
  2. Provenance comparison: concrete derivation_path field examples from both
  3. Performance characterization: theoretical cost analysis with O() complexity per chain depth and dataset size
  4. Engine-availability matrix: which Datalog engines / SQL engines support what’s needed, in WASM and server contexts
  5. Recommendation: Datalog primary / SQL primary / hybrid — with explicit rationale on which axis breaks the tie
  6. Migration path if Challenge 10’s recursive-CTE direction is overturned
  • Specific Datalog engine selection (Nemo vs Soufflé vs Datomic) — covered by Challenge 11
  • Tier 1 materialized-folder generator design (uses whichever engine wins; Tier 1 doesn’t run derivation queries directly)
  • Real-data benchmarks (Challenge 02 territory)
  • Narrows from Challenge 10: Ch 10 picked recursive CTE without considering Datalog seriously
  • Coordinates with Challenge 11: Ch 11 picks the engine; Ch 12 picks the paradigm the engine must support
  • Independent of Challenge 13: different layer