Challenge 12: Datalog vs SQL for SSSOM chain-rule derivation
Why this exists
Section titled “Why this exists”The Crosswalker pairwise + pivot crosswalk commitment requires a derivation engine — the component that composes pairwise mappings through a pivot/spine using the SSSOM chain rules (22 distinct rules per the SSSOM chain-rules document). Two viable approaches:
- Datalog — Native fit. The OxO2 paper (EMBL-EBI, 2025; arXiv 2506.04286) implements SSSOM chain rules in Nemo (Datalog). RCE2 (the rule OxO2 most-uses) is a textbook Datalog rule. Cleaner provenance via Datalog’s native rule-firing trace
- Recursive CTE — Broader engine support (every SQL DB has it). Challenge 10’s deliverable picked this path. Less expressive than Datalog for variable-depth traversal with provenance
This is a direct fork in the road. The choice ripples into the Tier 2/3 engine selection (Challenge 11) and into the audit-trail story (a clean derivation_path field per derived edge requires the engine to expose its rule-firing trace).
What to investigate
Section titled “What to investigate”1. Formal comparison of expressiveness
Section titled “1. Formal comparison of expressiveness”Express the SSSOM chain rules in both forms:
- Datalog: write RCE2 (and at least 3 other representative rules) as Datalog clauses with provenance. Show how
derived_from(SSSOM PR #537) populates from rule firings - Recursive CTE: write the same rules as recursive SQL CTEs targeting DuckDB or Postgres syntax. Show how
derivation_pathpopulates as a UDF or alist_appendaccumulator
Compare:
- Lines of code per rule
- Readability for a fresh contributor
- Edge-case handling (cycles, ambiguity, non-termination)
- Provenance fidelity — does the rule firing show which rule fired, in what order, on what input?
2. Performance characteristics
Section titled “2. Performance characteristics”Theoretical (no benchmarks required, but characterize the cost models):
- Depth-2 to depth-5 chain composition on representative dataset sizes (10K, 100K, 1M mappings)
- Cycle detection cost (Datalog’s bottom-up fixed-point vs CTE’s
visitedarray) - Recursion-depth limits (Datalog has none; SQL engines vary — Postgres unlimited, DuckDB unlimited, sql.js capped)
- Memory footprint at fixed-point convergence
3. Engine availability
Section titled “3. Engine availability”Constrain by what engines run in the deployment shapes Challenge 11 will pick:
- Browser-side (Tier 2): which Datalog engines run in WASM? Nemo has WASM. Soufflé’s WASM story?
- Server-side (Tier 3): every Datalog engine is available; SQL is universal
- Cross-tier consistency: if Tier 2 uses Datalog and Tier 3 uses Postgres+SQL, what’s the cost of maintaining two derivation paths?
4. Provenance and audit-trail implications
Section titled “4. Provenance and audit-trail implications”Challenge 08 (and the eventual Challenge 13 follow-on) require every derived edge to carry an auditable derivation_path:
- Which input edges contributed?
- Which chain rule fired?
- What was the inherited confidence / staleness?
Datalog engines often natively expose this (OxO2 records “derived under rule X” per derivation). SQL recursive CTEs require explicit accumulator construction. Compare:
- Cleanliness of the resulting
derivation_pathfield - Cost of injecting provenance into recursive CTE accumulators
- Compatibility with SSSOM’s pending
derived_fromslot
5. Maintainability
Section titled “5. Maintainability”- Which is easier for a fresh contributor to read, debug, and extend?
- Which has better tooling (debuggers, profilers, query plans)?
- Which is more failure-mode-stable under bad input data?
6. Hybrid possibilities
Section titled “6. Hybrid possibilities”- Datalog produces materialized derived rows; SQL queries them: Datalog runs at materialization time, results live in a SQL table, downstream queries are pure SQL. Best of both?
- SQL produces base mappings; Datalog computes closures on demand: opposite layering
- Both are wrong; use a different paradigm entirely (e.g., Apache Arrow’s stream-graph computation)
Success criteria for the deliverable
Section titled “Success criteria for the deliverable”- Worked example: at least 4 SSSOM chain rules expressed in both Datalog and recursive CTE
- Provenance comparison: concrete
derivation_pathfield examples from both - Performance characterization: theoretical cost analysis with O() complexity per chain depth and dataset size
- Engine-availability matrix: which Datalog engines / SQL engines support what’s needed, in WASM and server contexts
- Recommendation: Datalog primary / SQL primary / hybrid — with explicit rationale on which axis breaks the tie
- Migration path if Challenge 10’s recursive-CTE direction is overturned
Out of scope
Section titled “Out of scope”- Specific Datalog engine selection (Nemo vs Soufflé vs Datomic) — covered by Challenge 11
- Tier 1 materialized-folder generator design (uses whichever engine wins; Tier 1 doesn’t run derivation queries directly)
- Real-data benchmarks (Challenge 02 territory)
Relationship to prior challenges
Section titled “Relationship to prior challenges”- Narrows from Challenge 10: Ch 10 picked recursive CTE without considering Datalog seriously
- Coordinates with Challenge 11: Ch 11 picks the engine; Ch 12 picks the paradigm the engine must support
- Independent of Challenge 13: different layer
Related
Section titled “Related”- Challenge 10: Graph→tabular bridging engine — predecessor; recursive-CTE-only treatment
- Challenge 11: Tier 2/3 engine deep survey — coordinates
- 05-01 §2.1 pairwise + optional inheritable pivot — the commitment that requires a derivation engine
- SSSOM chain-rules document and the OxO2 paper (arXiv 2506.04286) — primary external references
- Roadmap: Foundation