SNOMED CT Integration

This document describes the technical implementation of SNOMED CT terminology in CheckTick. For dataset loading paths and the survey runtime, see Dataset Loading Architecture. For user-facing guidance, see Datasets and Dropdowns.

Core Design Principle

SNOMED CT options are served live from a local SQLite database (snomed.db) generated by the sct Rust binary. The actual terminology is never copied into Postgres.

CheckTick holds one lightweight descriptor record per exposed refset in Postgres. When a dropdown or typeahead is rendered, Django queries snomed.db directly via the SnomedResolver service. This means:

SNOMED terms are always current — update snomed.db and every dropdown updates immediately
Postgres stays stable across SNOMED release cycles (no data migrations when terminology changes)
Adding a new refset requires only a single descriptor row insert — no data migration
Self-hosters who have not set up SNOMED get graceful degradation, not a 500 error

What Stays in Postgres

Each exposed SNOMED list has a DataSet record in Postgres:

Field	Type	Purpose
`key`	`CharField`	Stable identifier used in URLs and survey builder
`name`	`CharField`	Display name shown in the dataset browser
`category`	`"snomed"`	Routes all option fetching through `SnomedResolver`
`source_type`	`"snomed_db"`	Signals that the source is a local SQLite file, not an API or manual entry
`snomed_refset_id`	`CharField`	The SNOMED refset SCTID or root concept SCTID for hierarchy queries
`snomed_query_type`	`CharField`	One of `refset`, `descendants`, or `ecl` — controls how `snomed.db` is queried
`snomed_ecl`	`TextField`	ECL expression when `snomed_query_type = "ecl"`
`snomed_release_date`	`DateField`	Release date of the `snomed.db` used when this descriptor was seeded
`snomed_member_count`	`IntegerField`	Concept count recorded at seed time; used to select the appropriate UI widget
`is_featured`	`BooleanField`	`True` = shown prominently; `False` = hidden by default but searchable
`options`	`JSONField`	Always `[]` for SNOMED datasets — never written to

The DataSet record controls discoverability and access. It does not hold terminology data.

What Lives in `snomed.db`

The SQLite file is generated by sct trud download --edition uk_monolith --pipeline and stored on a persistent volume (never committed to git). It contains concepts, descriptions, relationships, refsets, and FTS5 full-text indexes for the full UK SNOMED CT Monolith edition.

Default path: /app/data/snomed.db (override with SNOMED_DB_PATH env var).

The file is approximately 500 MB–1 GB at rest. Intermediate build artefacts peak at around 3.5 GB on disk during an sct rebuild — see Self-Hosting Volumes.

SnomedResolver Service

checktick_app/surveys/snomed_resolver.py is the single point of contact between Django and snomed.db.

Public API

from surveys.snomed_resolver import get_options, search, SnomedUnavailableError

# Returns list of "SCTID | preferred term" strings for a given DataSet descriptor
options = get_options(dataset)          # raises SnomedUnavailableError if snomed.db absent

# FTS5 full-text search — returns [{id: sctid, text: preferred_term}, ...]
results = search(query="epilepsy", limit=20)

SnomedUnavailableError is raised when snomed.db is not found at the configured path. It must always be caught at the view layer — never allowed to propagate as a 500.

Connection Strategy

snomed.db is opened read-only (?mode=ro) using Python's built-in sqlite3 module. Connections are held in threading.local() storage so that each Django worker thread reuses its own connection without locking. Django's ORM is not involved.

Query Types

`snomed_query_type`	SQL strategy
`refset`	`SELECT conceptId, term FROM refset_members WHERE refsetId = ?`
`descendants`	Recursive CTE walking the `is_a` relationship from the root SCTID
`ecl`	ECL expression evaluated via `sct`'s embedded parser (subset of full ECL)

Curated Refsets

Featured vs Non-Featured

The dataset browser defaults to showing only is_featured = True SNOMED datasets. The ?show_all_snomed=1 query parameter reveals the full set.

`is_featured`	Meaning	Default visibility
`True`	Clinically meaningful, curated for NHS use	Shown prominently in the dataset browser
`False`	Administrative, technical, or bulk content	Hidden by default; surfaced via search

Currently Seeded Refsets (Featured)

The seed_snomed_datasets management command seeds the following featured datasets. Sizes are recorded at seed time as snomed_member_count.

Dataset	Query type	Typical size
Antiepileptic drug list (QOF)	`refset`	~40
Diabetes drug list (QOF)	`refset`	~60
Atrial fibrillation drug list (QOF)	`refset`	~20
COPD/Asthma drug list (QOF)	`refset`	~50
UK Ethnic Category (NHS)	`refset`	~30
UK Allergy Substances	`refset`	~300
Paediatric epilepsy syndromes	`refset`	~50
Paediatric endocrine disorders	`refset`	~80
Paediatric cardiac conditions	`refset`	~100
Paediatric respiratory conditions	`refset`	~80
Rare chromosomal conditions	`refset`	~60
Paediatric neuromuscular disorders	`refset`	~70
Paediatric epilepsy genes	`refset`	~40
Paediatric renal conditions	`refset`	~60
Paediatric GI conditions	`refset`	~80

Non-Featured Descriptors

dm+d VTM (~1,000 drug substances) and VMP (~20,000 medicinal products) are seeded as is_featured = False. They are present in snomed.db and available via search or admin promotion, but are not surfaced as survey dropdown options because they are too large and too broad for a constrained question.

snomed_member_count (stored at seed time) determines the interaction widget at survey build and render time:

Member count	Widget
< 500	Standard `<select>` dropdown
500 – 2,000	Searchable select (combobox)
> 2,000	Typeahead search required

The survey builder shows an informative hint when a SNOMED dataset is selected. The snomed_search endpoint (GET /surveys/datasets/snomed/search/?q=) powers the typeahead.

Dataset Loading Paths

For full detail on how SNOMED options flow from snomed.db through to the respondent view, the survey builder, and the REST API, see Dataset Loading Architecture.

Key points:

_inject_dataset_options(questions) in surveys/views.py is the render-time bridge — it materialises SNOMED options onto q.options for the respondent template
When snomed.db is absent, q.snomed_unavailable = True is set so the template renders an informative alert rather than an empty <select>
The REST API (DataSetSerializer.to_representation()) resolves live SNOMED options as a {sctid: preferred_term} dict, returning snomed_unavailable: true on error
CSV export resolves stored SCTIDs to preferred terms using a pre-built per-question lookup table before streaming rows

Response storage: SNOMED answers are stored as SCTIDs, not display terms. SCTIDs are stable identifiers that survive SNOMED release cycles. Preferred terms are resolved at display or export time.

Management Commands

`seed_snomed_datasets`

Creates or updates DataSet descriptor records for all curated refsets. Does not modify snomed.db. Safe to re-run (idempotent). Use --force to overwrite existing records.

python manage.py seed_snomed_datasets
python manage.py seed_snomed_datasets --force
python manage.py seed_snomed_datasets --dry-run

`update_snomed_db`

Checks TRUD for a newer SNOMED CT UK Monolith release. If one is found, downloads and rebuilds snomed.db via sct trud download --pipeline, then updates snomed_release_date on all SNOMED descriptor records.

Use --prune when running updates on constrained volumes. It removes stale tmp/, downloaded zip files, and old versioned uk_sct2*.db artefacts before the update check/build starts.

python manage.py update_snomed_db
python manage.py update_snomed_db --force    # re-download even if current
python manage.py update_snomed_db --dry-run  # check for new release only
python manage.py update_snomed_db --force --prune  # free space before rebuild

SNOMED CT UK is published infrequently. For self-hosted deployments, prefer a manual in-container update during a planned maintenance window (for example monthly or only when you want to refresh terminology), rather than a scheduled cron job.

See Scheduled Tasks for Northflank setup.

Environment Variables

# Path to snomed.db (default: /app/data/snomed.db)
SNOMED_DB_PATH=/app/data/snomed.db

# TRUD API key — required by update_snomed_db; not needed at request time
TRUD_API_KEY=your-trud-api-key

Setup and Self-Hosting

Prerequisites

TRUD account with a subscription to SNOMED CT UK Monolith Edition (item 1799)
sct binary in the Docker image (see Dockerfile)
TRUD_API_KEY and SNOMED_DB_PATH set in the environment
A persistent volume mounted at /app/data/ (or the configured path)

Building `snomed.db`

# Inside the web container — direct all output to the volume
docker compose exec web bash -c \
  "TRUD_API_KEY=\$TRUD_API_KEY sct trud download \
     --edition uk_monolith \
     --download-dir /app/data \
     --pipeline \
     --output /app/data/snomed.db"

# Then seed Postgres descriptors
docker compose exec web python manage.py seed_snomed_datasets

Self-Hosting Volumes

sct trud download --pipeline produces intermediate files before the final snomed.db:

Step	Artefact	Approximate size
Download RF2 zip	Temp download	~1.5 GB
`sct ndjson`	NDJSON intermediate	~1 GB+
`sct sqlite`	Final `snomed.db`	~500 MB–1 GB

Peak disk usage during a build can exceed 6 GB when temporary artefacts and the existing snomed.db overlap. All intermediate and output files must be written to the mounted persistent volume (not container ephemeral storage). A 10 GB volume is the practical minimum; 20 GB is recommended for safer headroom.

Run updates from inside the existing web/container shell, and use --prune before forced rebuilds to reduce burst usage.

The vault-data volume is dedicated to HashiCorp Vault's Raft storage and must not be shared.

Health Check

The /healthz endpoint reports SNOMED status:

{
  "snomed": {
    "status": "ok",
    "db_path": "/app/data/snomed.db",
    "release_date": "2024-10-01"
  }
}

If snomed.db is absent or SNOMED_DB_PATH is unset, status is "unavailable" — not an error.

Graceful Degradation

The application functions without snomed.db. When SnomedUnavailableError is raised:

Context	Behaviour
Dataset list	SNOMED datasets shown with an "Unavailable" badge
Dataset detail	Options column shows "SNOMED CT unavailable"
Survey respondent	Alert in place of the dropdown: "SNOMED CT terminology is currently unavailable"
Survey builder	Warning banner in the dataset picker; SNOMED datasets cannot be loaded
CSV export	SCTID stored as-is in the export; no resolution attempted
REST API	`snomed_unavailable: true` in the serialiser response

No 500 errors are raised. SnomedUnavailableError is caught at every view and serialiser call site.

Future Work

User SNOMED Codelists

Survey creators sometimes need a hand-picked subset of SNOMED concepts specific to their service — for example, "the 15 diagnosis codes used in our paediatric diabetes MDT". This is distinct from both CheckTick-curated refsets (top-down, platform-maintained) and plain custom datasets (not SNOMED-backed).

A user codelist would be stored as a DataSet with category="snomed" and source_type="user_codelist". Options would be snapshotted as {sctid: preferred_term} in Postgres at creation time (not live from snomed.db), since a frozen copy is appropriate for user-authored lists.

Prerequisites before implementation:

An FTS5 typeahead UI for searching the full 831k-concept vocabulary (the open concept search endpoint already exists)
A policy decision on clinical safety guardrails (e.g. active-only concepts, hierarchy constraints)
Governance model for sharing codelists within or across organisations

The "Create Custom Version" snapshot flow on the dataset detail page provides a partial equivalent for the majority of use cases.

ECL Expression Builder

The snomed_query_type = "ecl" path is implemented in SnomedResolver and can be used via the Django admin today. A guided ECL builder UI for technically confident users — with live preview and validation against snomed.db — would unlock more precise, user-authored constraint expressions without requiring SNOMED expertise.

Other Terminology Systems

The architecture is designed to extend. If sct gains support for additional terminologies, or equivalent SQLite files become available, CheckTick can add new resolvers following the same pattern (LoincResolver, etc.) without changes to the DataSet model or the UI layer. A consistent SQLite schema across systems (concepts table with id, preferred_term, FTS5) would allow a single generic TerminologyResolver.

ICD-11 is available via a free WHO REST API (id.who.int) and could be implemented as an external_api dataset source. DSM-5 is proprietary and out of scope unless a freely licensable structured form is identified.

Dataset Loading Architecture — full data flow for all dataset types
Datasets and Dropdowns — user-facing guide
Self-hosting Scheduled Tasks — manual SNOMED maintenance guidance
sct project — the Rust binary that generates snomed.db
NHS TRUD — source of SNOMED CT releases

Roadmap: Unified Dataset Creation Assistant

This roadmap describes a user-friendly flow where users ask for a list in plain language, the system checks existing datasets first, and only creates a new SNOMED-driven list when needed. The UI should hide ECL by default.

Product goals

Single entry point for dataset creation to reduce confusion between NHS DD and SNOMED
Deterministic source routing: NHS DD first, SNOMED second, never mixed in one dataset
Fast path for users with known concept IDs
LLM used as a constrained planner, not as the final source of truth
Full provenance and audit trail for generated lists

Scope in two features

Create lists from SNOMED CT concept IDs
Create lists from natural language via LLM with ECL under the hood, integrated with duplicate checks against NHS DD first

Phase plan

Phase 1: Source-aware assistant shell

Steps:

Add one "Describe the list you need" entry point in dataset creation UI
Implement retrieval-first routing service:
Search NHS DD datasets first
If no strong match, search SNOMED curated/hosted datasets
If no suitable match, offer "Create new draft"
Return source decision and confidence explanation in plain language

Suggested snippet (routing result contract):

from dataclasses import dataclass
from typing import Literal


@dataclass
class DatasetSuggestion:
    source: Literal["nhs_dd", "snomed", "new_snomed_draft"]
    reason: str
    dataset_key: str | None
    confidence: float

Phase 2: SNOMED concept ID fast path

Steps:

Add "Paste concept IDs" mode in the same assistant
Validate IDs against snomed.db and resolve preferred terms
Save as user-owned draft dataset with frozen options ({sctid: term})
Record SNOMED release metadata at creation time

Suggested snippet (ID validation service):

def resolve_snomed_ids(ids: list[str]) -> dict[str, str]:
    """Return validated {sctid: preferred_term}; raise on invalid IDs."""
    # Query concepts table for active concepts only.
    # Reject unknown/inactive IDs with actionable error details.
    ...

Phase 3: Natural language to SNOMED draft (ECL hidden)

Steps:

Add LLM planning endpoint that outputs structured proposal JSON
Proposal includes:
candidate ECL (internal)
plain-language inclusion/exclusion rationale
assumptions and ambiguity notes
Execute proposal deterministically against snomed.db
Show preview (count, sample concepts, source badge) before save
Save as draft only; require explicit user confirmation to publish/share

Suggested snippet (LLM proposal schema):

class SnomedDraftProposal(TypedDict):
    user_intent: str
    candidate_ecl: str
    include_notes: list[str]
    exclude_notes: list[str]
    assumptions: list[str]

Phase 4: Governance and drift management

Steps:

Add review statuses: draft, review, published
Enforce reviewer confirmation before published status
Add "revalidate against current SNOMED release" operation
Add drift report for inactive concepts and term changes

Guardrails and non-negotiables

Do not mix NHS DD and SNOMED options within one dataset
Retrieval-first before any LLM generation
Active concepts only unless explicitly overridden by policy
Hard caps for generated list size and query execution time
Never expose raw internal exception messages in API responses
Persist provenance for each generated draft:
prompt
planner output
final query/ECL
SNOMED release date
creator/reviewer

Suggested implementation components

DatasetIntentRouter service:
suggest_existing(query: str) -> list[DatasetSuggestion]
deterministic scoring with source priority (nhs_dd > snomed)
SnomedDraftBuilder service:
from_concept_ids(ids: list[str])
from_natural_language(prompt: str)
DatasetDraft model fields (or equivalent metadata fields):
origin_source, creation_mode, snomed_release_date, provenance_blob

Test coverage plan

Unit tests

Router priority:
NHS DD exact/strong match chosen over SNOMED alternatives
SNOMED fallback only when NHS DD no-match or low confidence
Source isolation:
creating datasets never mixes categories in options payload
SNOMED ID path:
invalid/inactive IDs rejected
valid IDs resolve to stable {sctid: term} mapping
LLM proposal validation:
malformed proposal rejected before query execution
empty/too-broad proposal handled with actionable feedback
Security behavior:
no raw exception leakage in user/API error messages

Integration tests

End-to-end natural language flow:
query -> suggestion -> preview -> draft saved
Duplicate prevention:
existing NHS DD list suggested instead of new SNOMED draft
Review lifecycle:
draft cannot be published without reviewer step
Drift check:
revalidation reports inactive/changed terms when release updates

Performance tests

Latency budget for assistant query path (router + preview)
Large generated set handling (cap, pagination, and timeout behavior)

Documentation plan

Update and/or add documentation when implementing this roadmap:

Update docs/datasets-and-dropdowns.md
explain unified assistant and source badges
include concept ID fast path guidance
Update docs/dataset-loading-architecture.md
include assistant routing and draft creation flow
document provenance fields and source-selection rules
Update docs/snomed-integration.md (this document)
move implemented roadmap items into main architecture sections
Add docs/llm-dataset-assistant.md (new)
prompt contract, guardrails, fallback behavior, and known limitations
Add API docs in docs/api.md
suggestion endpoint
preview endpoint
draft creation endpoint
validation/error response examples

Success criteria

Users can request lists in plain language without SNOMED/ECL expertise
Existing NHS DD datasets are reused whenever appropriate
SNOMED custom lists are reproducible, auditable, and reviewable
No category mixing and no sensitive error leakage
Draft-to-publish flow is clinically governable and test-backed

Decision Log (Open)

Use this section to track unresolved product and safety decisions during implementation. Keep each item updated with owner/date when a decision is made.

Topic	Decision needed	Suggested default	Status
NHS DD vs SNOMED routing threshold	What confidence score should trigger "use existing NHS DD" vs "create SNOMED draft"?	Prefer NHS DD when confidence >= 0.75 and no explicit override in user intent	Open
Duplicate detection strictness	Should near-duplicate custom lists be blocked or warned?	Warn by default, block only when exact option-set match	Open
Generated list size cap	Maximum concept count for draft generation and preview?	Cap at 2,000 for immediate preview, require refined query beyond cap	Open
Publish permissions	Who can publish SNOMED drafts to shared/global scope?	Org admin or designated clinical reviewer only	Open
Review signoff policy	Is one reviewer enough, or dual signoff for high-impact lists?	One reviewer for org-private lists, two for org-shared/global lists	Open
SNOMED release pinning	Should draft creation always pin to current release date?	Yes, mandatory release pin + revalidation prompt on release change	Open
LLM provider and fallback	Which hosted model(s), and what fallback behavior on timeout?	Primary hosted model + deterministic non-LLM fallback to retrieval suggestions only	Open
Prompt/data retention	Should user prompts used for list generation be retained?	Retain minimal provenance for audit, redact sensitive free text where possible	Open
User-visible rationale depth	How much explanation is shown to non-technical users?	Plain-language summary with expandable "details" panel	Open
Internationalization	Which languages are supported in assistant prompts and explanations?	Start with English only, add i18n once workflow stabilizes	Open

Decision record template

When a row is decided, add a short note below using this template:

- [YYYY-MM-DD] Topic: <topic>
  Decision: <what was decided>
  Owner: <name/role>
  Rationale: <1-3 lines>
  Follow-up: <tickets/docs/tests>

SNOMED CT Integration

Core Design Principle

What Stays in Postgres

What Lives in snomed.db

SnomedResolver Service

Public API

Connection Strategy

Query Types

Curated Refsets

Featured vs Non-Featured

Currently Seeded Refsets (Featured)

Non-Featured Descriptors

Widget Selection by Member Count

Dataset Loading Paths

Management Commands

seed_snomed_datasets

update_snomed_db

Environment Variables

Setup and Self-Hosting

Prerequisites

Building snomed.db

Self-Hosting Volumes

Health Check

Graceful Degradation

Future Work

User SNOMED Codelists

ECL Expression Builder

Other Terminology Systems

Related Documentation

Roadmap: Unified Dataset Creation Assistant

Product goals

Scope in two features

Phase plan

Phase 1: Source-aware assistant shell

Phase 2: SNOMED concept ID fast path

Phase 3: Natural language to SNOMED draft (ECL hidden)

Phase 4: Governance and drift management

Guardrails and non-negotiables

Suggested implementation components

Test coverage plan

Unit tests

Integration tests

Performance tests

Documentation plan

Success criteria

Decision Log (Open)

Decision record template

What Lives in `snomed.db`

`seed_snomed_datasets`

`update_snomed_db`

Building `snomed.db`