SNOMED CT Integration

This document describes the technical implementation of SNOMED CT terminology in CheckTick. For dataset loading paths and the survey runtime, see Dataset Loading Architecture. For user-facing guidance, see Datasets and Dropdowns.


Core Design Principle

SNOMED CT options are served live from a local SQLite database (snomed.db) generated by the sct Rust binary. The actual terminology is never copied into Postgres.

CheckTick holds one lightweight descriptor record per exposed refset in Postgres. When a dropdown or typeahead is rendered, Django queries snomed.db directly via the SnomedResolver service. This means:

  • SNOMED terms are always current โ€” update snomed.db and every dropdown updates immediately
  • Postgres stays stable across SNOMED release cycles (no data migrations when terminology changes)
  • Adding a new refset requires only a single descriptor row insert โ€” no data migration
  • Self-hosters who have not set up SNOMED get graceful degradation, not a 500 error

What Stays in Postgres

Each exposed SNOMED list has a DataSet record in Postgres:

Field Type Purpose
key CharField Stable identifier used in URLs and survey builder
name CharField Display name shown in the dataset browser
category "snomed" Routes all option fetching through SnomedResolver
source_type "snomed_db" Signals that the source is a local SQLite file, not an API or manual entry
snomed_refset_id CharField The SNOMED refset SCTID or root concept SCTID for hierarchy queries
snomed_query_type CharField One of refset, descendants, or ecl โ€” controls how snomed.db is queried
snomed_ecl TextField ECL expression when snomed_query_type = "ecl"
snomed_release_date DateField Release date of the snomed.db used when this descriptor was seeded
snomed_member_count IntegerField Concept count recorded at seed time; used to select the appropriate UI widget
is_featured BooleanField True = shown prominently; False = hidden by default but searchable
options JSONField Always [] for SNOMED datasets โ€” never written to

The DataSet record controls discoverability and access. It does not hold terminology data.


What Lives in snomed.db

The SQLite file is generated by sct trud download --edition uk_monolith --pipeline and stored on a persistent volume (never committed to git). It contains concepts, descriptions, relationships, refsets, and FTS5 full-text indexes for the full UK SNOMED CT Monolith edition.

Default path: /app/data/snomed.db (override with SNOMED_DB_PATH env var).

The file is approximately 500 MBโ€“1 GB at rest. Intermediate build artefacts peak at around 3.5 GB on disk during an sct rebuild โ€” see Self-Hosting Volumes.


SnomedResolver Service

checktick_app/surveys/snomed_resolver.py is the single point of contact between Django and snomed.db.

Public API

from surveys.snomed_resolver import get_options, search, SnomedUnavailableError

# Returns list of "SCTID | preferred term" strings for a given DataSet descriptor
options = get_options(dataset)          # raises SnomedUnavailableError if snomed.db absent

# FTS5 full-text search โ€” returns [{id: sctid, text: preferred_term}, ...]
results = search(query="epilepsy", limit=20)

SnomedUnavailableError is raised when snomed.db is not found at the configured path. It must always be caught at the view layer โ€” never allowed to propagate as a 500.

Connection Strategy

snomed.db is opened read-only (?mode=ro) using Python's built-in sqlite3 module. Connections are held in threading.local() storage so that each Django worker thread reuses its own connection without locking. Django's ORM is not involved.

Query Types

snomed_query_type SQL strategy
refset SELECT conceptId, term FROM refset_members WHERE refsetId = ?
descendants Recursive CTE walking the is_a relationship from the root SCTID
ecl ECL expression evaluated via sct's embedded parser (subset of full ECL)

Curated Refsets

The dataset browser defaults to showing only is_featured = True SNOMED datasets. The ?show_all_snomed=1 query parameter reveals the full set.

is_featured Meaning Default visibility
True Clinically meaningful, curated for NHS use Shown prominently in the dataset browser
False Administrative, technical, or bulk content Hidden by default; surfaced via search

The seed_snomed_datasets management command seeds the following featured datasets. Sizes are recorded at seed time as snomed_member_count.

Dataset Query type Typical size
Antiepileptic drug list (QOF) refset ~40
Diabetes drug list (QOF) refset ~60
Atrial fibrillation drug list (QOF) refset ~20
COPD/Asthma drug list (QOF) refset ~50
UK Ethnic Category (NHS) refset ~30
UK Allergy Substances refset ~300
Paediatric epilepsy syndromes refset ~50
Paediatric endocrine disorders refset ~80
Paediatric cardiac conditions refset ~100
Paediatric respiratory conditions refset ~80
Rare chromosomal conditions refset ~60
Paediatric neuromuscular disorders refset ~70
Paediatric epilepsy genes refset ~40
Paediatric renal conditions refset ~60
Paediatric GI conditions refset ~80

dm+d VTM (~1,000 drug substances) and VMP (~20,000 medicinal products) are seeded as is_featured = False. They are present in snomed.db and available via search or admin promotion, but are not surfaced as survey dropdown options because they are too large and too broad for a constrained question.

Widget Selection by Member Count

snomed_member_count (stored at seed time) determines the interaction widget at survey build and render time:

Member count Widget
< 500 Standard <select> dropdown
500 โ€“ 2,000 Searchable select (combobox)
> 2,000 Typeahead search required

The survey builder shows an informative hint when a SNOMED dataset is selected. The snomed_search endpoint (GET /surveys/datasets/snomed/search/?q=) powers the typeahead.


Dataset Loading Paths

For full detail on how SNOMED options flow from snomed.db through to the respondent view, the survey builder, and the REST API, see Dataset Loading Architecture.

Key points:

  • _inject_dataset_options(questions) in surveys/views.py is the render-time bridge โ€” it materialises SNOMED options onto q.options for the respondent template
  • When snomed.db is absent, q.snomed_unavailable = True is set so the template renders an informative alert rather than an empty <select>
  • The REST API (DataSetSerializer.to_representation()) resolves live SNOMED options as a {sctid: preferred_term} dict, returning snomed_unavailable: true on error
  • CSV export resolves stored SCTIDs to preferred terms using a pre-built per-question lookup table before streaming rows

Response storage: SNOMED answers are stored as SCTIDs, not display terms. SCTIDs are stable identifiers that survive SNOMED release cycles. Preferred terms are resolved at display or export time.


Management Commands

seed_snomed_datasets

Creates or updates DataSet descriptor records for all curated refsets. Does not modify snomed.db. Safe to re-run (idempotent). Use --force to overwrite existing records.

python manage.py seed_snomed_datasets
python manage.py seed_snomed_datasets --force
python manage.py seed_snomed_datasets --dry-run

update_snomed_db

Checks TRUD for a newer SNOMED CT UK Monolith release. If one is found, downloads and rebuilds snomed.db via sct trud download --pipeline, then updates snomed_release_date on all SNOMED descriptor records.

Use --prune when running updates on constrained volumes. It removes stale tmp/, downloaded zip files, and old versioned uk_sct2*.db artefacts before the update check/build starts.

python manage.py update_snomed_db
python manage.py update_snomed_db --force    # re-download even if current
python manage.py update_snomed_db --dry-run  # check for new release only
python manage.py update_snomed_db --force --prune  # free space before rebuild

SNOMED CT UK is published infrequently. For self-hosted deployments, prefer a manual in-container update during a planned maintenance window (for example monthly or only when you want to refresh terminology), rather than a scheduled cron job.

See Scheduled Tasks for Northflank setup.


Environment Variables

# Path to snomed.db (default: /app/data/snomed.db)
SNOMED_DB_PATH=/app/data/snomed.db

# TRUD API key โ€” required by update_snomed_db; not needed at request time
TRUD_API_KEY=your-trud-api-key

Setup and Self-Hosting

Prerequisites

  1. TRUD account with a subscription to SNOMED CT UK Monolith Edition (item 1799)
  2. sct binary in the Docker image (see Dockerfile)
  3. TRUD_API_KEY and SNOMED_DB_PATH set in the environment
  4. A persistent volume mounted at /app/data/ (or the configured path)

Building snomed.db

# Inside the web container โ€” direct all output to the volume
docker compose exec web bash -c \
  "TRUD_API_KEY=\$TRUD_API_KEY sct trud download \
     --edition uk_monolith \
     --download-dir /app/data \
     --pipeline \
     --output /app/data/snomed.db"

# Then seed Postgres descriptors
docker compose exec web python manage.py seed_snomed_datasets

Self-Hosting Volumes

sct trud download --pipeline produces intermediate files before the final snomed.db:

Step Artefact Approximate size
Download RF2 zip Temp download ~1.5 GB
sct ndjson NDJSON intermediate ~1 GB+
sct sqlite Final snomed.db ~500 MBโ€“1 GB

Peak disk usage during a build can exceed 6 GB when temporary artefacts and the existing snomed.db overlap. All intermediate and output files must be written to the mounted persistent volume (not container ephemeral storage). A 10 GB volume is the practical minimum; 20 GB is recommended for safer headroom.

Run updates from inside the existing web/container shell, and use --prune before forced rebuilds to reduce burst usage.

The vault-data volume is dedicated to HashiCorp Vault's Raft storage and must not be shared.

Health Check

The /healthz endpoint reports SNOMED status:

{
  "snomed": {
    "status": "ok",
    "db_path": "/app/data/snomed.db",
    "release_date": "2024-10-01"
  }
}

If snomed.db is absent or SNOMED_DB_PATH is unset, status is "unavailable" โ€” not an error.


Graceful Degradation

The application functions without snomed.db. When SnomedUnavailableError is raised:

Context Behaviour
Dataset list SNOMED datasets shown with an "Unavailable" badge
Dataset detail Options column shows "SNOMED CT unavailable"
Survey respondent Alert in place of the dropdown: "SNOMED CT terminology is currently unavailable"
Survey builder Warning banner in the dataset picker; SNOMED datasets cannot be loaded
CSV export SCTID stored as-is in the export; no resolution attempted
REST API snomed_unavailable: true in the serialiser response

No 500 errors are raised. SnomedUnavailableError is caught at every view and serialiser call site.


Future Work

User SNOMED Codelists

Survey creators sometimes need a hand-picked subset of SNOMED concepts specific to their service โ€” for example, "the 15 diagnosis codes used in our paediatric diabetes MDT". This is distinct from both CheckTick-curated refsets (top-down, platform-maintained) and plain custom datasets (not SNOMED-backed).

A user codelist would be stored as a DataSet with category="snomed" and source_type="user_codelist". Options would be snapshotted as {sctid: preferred_term} in Postgres at creation time (not live from snomed.db), since a frozen copy is appropriate for user-authored lists.

Prerequisites before implementation:

  • An FTS5 typeahead UI for searching the full 831k-concept vocabulary (the open concept search endpoint already exists)
  • A policy decision on clinical safety guardrails (e.g. active-only concepts, hierarchy constraints)
  • Governance model for sharing codelists within or across organisations

The "Create Custom Version" snapshot flow on the dataset detail page provides a partial equivalent for the majority of use cases.

ECL Expression Builder

The snomed_query_type = "ecl" path is implemented in SnomedResolver and can be used via the Django admin today. A guided ECL builder UI for technically confident users โ€” with live preview and validation against snomed.db โ€” would unlock more precise, user-authored constraint expressions without requiring SNOMED expertise.

Other Terminology Systems

The architecture is designed to extend. If sct gains support for additional terminologies, or equivalent SQLite files become available, CheckTick can add new resolvers following the same pattern (LoincResolver, etc.) without changes to the DataSet model or the UI layer. A consistent SQLite schema across systems (concepts table with id, preferred_term, FTS5) would allow a single generic TerminologyResolver.

ICD-11 is available via a free WHO REST API (id.who.int) and could be implemented as an external_api dataset source. DSM-5 is proprietary and out of scope unless a freely licensable structured form is identified.



Roadmap: Unified Dataset Creation Assistant

This roadmap describes a user-friendly flow where users ask for a list in plain language, the system checks existing datasets first, and only creates a new SNOMED-driven list when needed. The UI should hide ECL by default.

Product goals

  • Single entry point for dataset creation to reduce confusion between NHS DD and SNOMED
  • Deterministic source routing: NHS DD first, SNOMED second, never mixed in one dataset
  • Fast path for users with known concept IDs
  • LLM used as a constrained planner, not as the final source of truth
  • Full provenance and audit trail for generated lists

Scope in two features

  1. Create lists from SNOMED CT concept IDs
  2. Create lists from natural language via LLM with ECL under the hood, integrated with duplicate checks against NHS DD first

Phase plan

Phase 1: Source-aware assistant shell

Steps:

  • Add one "Describe the list you need" entry point in dataset creation UI
  • Implement retrieval-first routing service:
  • Search NHS DD datasets first
  • If no strong match, search SNOMED curated/hosted datasets
  • If no suitable match, offer "Create new draft"
  • Return source decision and confidence explanation in plain language

Suggested snippet (routing result contract):

from dataclasses import dataclass
from typing import Literal


@dataclass
class DatasetSuggestion:
    source: Literal["nhs_dd", "snomed", "new_snomed_draft"]
    reason: str
    dataset_key: str | None
    confidence: float

Phase 2: SNOMED concept ID fast path

Steps:

  • Add "Paste concept IDs" mode in the same assistant
  • Validate IDs against snomed.db and resolve preferred terms
  • Save as user-owned draft dataset with frozen options ({sctid: term})
  • Record SNOMED release metadata at creation time

Suggested snippet (ID validation service):

def resolve_snomed_ids(ids: list[str]) -> dict[str, str]:
    """Return validated {sctid: preferred_term}; raise on invalid IDs."""
    # Query concepts table for active concepts only.
    # Reject unknown/inactive IDs with actionable error details.
    ...

Phase 3: Natural language to SNOMED draft (ECL hidden)

Steps:

  • Add LLM planning endpoint that outputs structured proposal JSON
  • Proposal includes:
  • candidate ECL (internal)
  • plain-language inclusion/exclusion rationale
  • assumptions and ambiguity notes
  • Execute proposal deterministically against snomed.db
  • Show preview (count, sample concepts, source badge) before save
  • Save as draft only; require explicit user confirmation to publish/share

Suggested snippet (LLM proposal schema):

class SnomedDraftProposal(TypedDict):
    user_intent: str
    candidate_ecl: str
    include_notes: list[str]
    exclude_notes: list[str]
    assumptions: list[str]

Phase 4: Governance and drift management

Steps:

  • Add review statuses: draft, review, published
  • Enforce reviewer confirmation before published status
  • Add "revalidate against current SNOMED release" operation
  • Add drift report for inactive concepts and term changes

Guardrails and non-negotiables

  • Do not mix NHS DD and SNOMED options within one dataset
  • Retrieval-first before any LLM generation
  • Active concepts only unless explicitly overridden by policy
  • Hard caps for generated list size and query execution time
  • Never expose raw internal exception messages in API responses
  • Persist provenance for each generated draft:
  • prompt
  • planner output
  • final query/ECL
  • SNOMED release date
  • creator/reviewer

Suggested implementation components

  • DatasetIntentRouter service:
  • suggest_existing(query: str) -> list[DatasetSuggestion]
  • deterministic scoring with source priority (nhs_dd > snomed)
  • SnomedDraftBuilder service:
  • from_concept_ids(ids: list[str])
  • from_natural_language(prompt: str)
  • DatasetDraft model fields (or equivalent metadata fields):
  • origin_source, creation_mode, snomed_release_date, provenance_blob

Test coverage plan

Unit tests

  • Router priority:
  • NHS DD exact/strong match chosen over SNOMED alternatives
  • SNOMED fallback only when NHS DD no-match or low confidence
  • Source isolation:
  • creating datasets never mixes categories in options payload
  • SNOMED ID path:
  • invalid/inactive IDs rejected
  • valid IDs resolve to stable {sctid: term} mapping
  • LLM proposal validation:
  • malformed proposal rejected before query execution
  • empty/too-broad proposal handled with actionable feedback
  • Security behavior:
  • no raw exception leakage in user/API error messages

Integration tests

  • End-to-end natural language flow:
  • query -> suggestion -> preview -> draft saved
  • Duplicate prevention:
  • existing NHS DD list suggested instead of new SNOMED draft
  • Review lifecycle:
  • draft cannot be published without reviewer step
  • Drift check:
  • revalidation reports inactive/changed terms when release updates

Performance tests

  • Latency budget for assistant query path (router + preview)
  • Large generated set handling (cap, pagination, and timeout behavior)

Documentation plan

Update and/or add documentation when implementing this roadmap:

  • Update docs/datasets-and-dropdowns.md
  • explain unified assistant and source badges
  • include concept ID fast path guidance
  • Update docs/dataset-loading-architecture.md
  • include assistant routing and draft creation flow
  • document provenance fields and source-selection rules
  • Update docs/snomed-integration.md (this document)
  • move implemented roadmap items into main architecture sections
  • Add docs/llm-dataset-assistant.md (new)
  • prompt contract, guardrails, fallback behavior, and known limitations
  • Add API docs in docs/api.md
  • suggestion endpoint
  • preview endpoint
  • draft creation endpoint
  • validation/error response examples

Success criteria

  • Users can request lists in plain language without SNOMED/ECL expertise
  • Existing NHS DD datasets are reused whenever appropriate
  • SNOMED custom lists are reproducible, auditable, and reviewable
  • No category mixing and no sensitive error leakage
  • Draft-to-publish flow is clinically governable and test-backed

Decision Log (Open)

Use this section to track unresolved product and safety decisions during implementation. Keep each item updated with owner/date when a decision is made.

Topic Decision needed Suggested default Status
NHS DD vs SNOMED routing threshold What confidence score should trigger "use existing NHS DD" vs "create SNOMED draft"? Prefer NHS DD when confidence >= 0.75 and no explicit override in user intent Open
Duplicate detection strictness Should near-duplicate custom lists be blocked or warned? Warn by default, block only when exact option-set match Open
Generated list size cap Maximum concept count for draft generation and preview? Cap at 2,000 for immediate preview, require refined query beyond cap Open
Publish permissions Who can publish SNOMED drafts to shared/global scope? Org admin or designated clinical reviewer only Open
Review signoff policy Is one reviewer enough, or dual signoff for high-impact lists? One reviewer for org-private lists, two for org-shared/global lists Open
SNOMED release pinning Should draft creation always pin to current release date? Yes, mandatory release pin + revalidation prompt on release change Open
LLM provider and fallback Which hosted model(s), and what fallback behavior on timeout? Primary hosted model + deterministic non-LLM fallback to retrieval suggestions only Open
Prompt/data retention Should user prompts used for list generation be retained? Retain minimal provenance for audit, redact sensitive free text where possible Open
User-visible rationale depth How much explanation is shown to non-technical users? Plain-language summary with expandable "details" panel Open
Internationalization Which languages are supported in assistant prompts and explanations? Start with English only, add i18n once workflow stabilizes Open

Decision record template

When a row is decided, add a short note below using this template:

- [YYYY-MM-DD] Topic: <topic>
  Decision: <what was decided>
  Owner: <name/role>
  Rationale: <1-3 lines>
  Follow-up: <tickets/docs/tests>