SNOMED CT Integration
This document describes the technical implementation of SNOMED CT terminology in CheckTick. For dataset loading paths and the survey runtime, see Dataset Loading Architecture. For user-facing guidance, see Datasets and Dropdowns.
Core Design Principle
SNOMED CT options are served live from a local SQLite database (snomed.db) generated by the sct Rust binary. The actual terminology is never copied into Postgres.
CheckTick holds one lightweight descriptor record per exposed refset in Postgres. When a dropdown or typeahead is rendered, Django queries snomed.db directly via the SnomedResolver service. This means:
- SNOMED terms are always current โ update
snomed.dband every dropdown updates immediately - Postgres stays stable across SNOMED release cycles (no data migrations when terminology changes)
- Adding a new refset requires only a single descriptor row insert โ no data migration
- Self-hosters who have not set up SNOMED get graceful degradation, not a 500 error
What Stays in Postgres
Each exposed SNOMED list has a DataSet record in Postgres:
| Field | Type | Purpose |
|---|---|---|
key |
CharField |
Stable identifier used in URLs and survey builder |
name |
CharField |
Display name shown in the dataset browser |
category |
"snomed" |
Routes all option fetching through SnomedResolver |
source_type |
"snomed_db" |
Signals that the source is a local SQLite file, not an API or manual entry |
snomed_refset_id |
CharField |
The SNOMED refset SCTID or root concept SCTID for hierarchy queries |
snomed_query_type |
CharField |
One of refset, descendants, or ecl โ controls how snomed.db is queried |
snomed_ecl |
TextField |
ECL expression when snomed_query_type = "ecl" |
snomed_release_date |
DateField |
Release date of the snomed.db used when this descriptor was seeded |
snomed_member_count |
IntegerField |
Concept count recorded at seed time; used to select the appropriate UI widget |
is_featured |
BooleanField |
True = shown prominently; False = hidden by default but searchable |
options |
JSONField |
Always [] for SNOMED datasets โ never written to |
The DataSet record controls discoverability and access. It does not hold terminology data.
What Lives in snomed.db
The SQLite file is generated by sct trud download --edition uk_monolith --pipeline and stored on a persistent volume (never committed to git). It contains concepts, descriptions, relationships, refsets, and FTS5 full-text indexes for the full UK SNOMED CT Monolith edition.
Default path: /app/data/snomed.db (override with SNOMED_DB_PATH env var).
The file is approximately 500 MBโ1 GB at rest. Intermediate build artefacts peak at around 3.5 GB on disk during an sct rebuild โ see Self-Hosting Volumes.
SnomedResolver Service
checktick_app/surveys/snomed_resolver.py is the single point of contact between Django and snomed.db.
Public API
from surveys.snomed_resolver import get_options, search, SnomedUnavailableError
# Returns list of "SCTID | preferred term" strings for a given DataSet descriptor
options = get_options(dataset) # raises SnomedUnavailableError if snomed.db absent
# FTS5 full-text search โ returns [{id: sctid, text: preferred_term}, ...]
results = search(query="epilepsy", limit=20)
SnomedUnavailableError is raised when snomed.db is not found at the configured path. It must always be caught at the view layer โ never allowed to propagate as a 500.
Connection Strategy
snomed.db is opened read-only (?mode=ro) using Python's built-in sqlite3 module. Connections are held in threading.local() storage so that each Django worker thread reuses its own connection without locking. Django's ORM is not involved.
Query Types
snomed_query_type |
SQL strategy |
|---|---|
refset |
SELECT conceptId, term FROM refset_members WHERE refsetId = ? |
descendants |
Recursive CTE walking the is_a relationship from the root SCTID |
ecl |
ECL expression evaluated via sct's embedded parser (subset of full ECL) |
Curated Refsets
Featured vs Non-Featured
The dataset browser defaults to showing only is_featured = True SNOMED datasets. The ?show_all_snomed=1 query parameter reveals the full set.
is_featured |
Meaning | Default visibility |
|---|---|---|
True |
Clinically meaningful, curated for NHS use | Shown prominently in the dataset browser |
False |
Administrative, technical, or bulk content | Hidden by default; surfaced via search |
Currently Seeded Refsets (Featured)
The seed_snomed_datasets management command seeds the following featured datasets. Sizes are recorded at seed time as snomed_member_count.
| Dataset | Query type | Typical size |
|---|---|---|
| Antiepileptic drug list (QOF) | refset |
~40 |
| Diabetes drug list (QOF) | refset |
~60 |
| Atrial fibrillation drug list (QOF) | refset |
~20 |
| COPD/Asthma drug list (QOF) | refset |
~50 |
| UK Ethnic Category (NHS) | refset |
~30 |
| UK Allergy Substances | refset |
~300 |
| Paediatric epilepsy syndromes | refset |
~50 |
| Paediatric endocrine disorders | refset |
~80 |
| Paediatric cardiac conditions | refset |
~100 |
| Paediatric respiratory conditions | refset |
~80 |
| Rare chromosomal conditions | refset |
~60 |
| Paediatric neuromuscular disorders | refset |
~70 |
| Paediatric epilepsy genes | refset |
~40 |
| Paediatric renal conditions | refset |
~60 |
| Paediatric GI conditions | refset |
~80 |
Non-Featured Descriptors
dm+d VTM (~1,000 drug substances) and VMP (~20,000 medicinal products) are seeded as is_featured = False. They are present in snomed.db and available via search or admin promotion, but are not surfaced as survey dropdown options because they are too large and too broad for a constrained question.
Widget Selection by Member Count
snomed_member_count (stored at seed time) determines the interaction widget at survey build and render time:
| Member count | Widget |
|---|---|
| < 500 | Standard <select> dropdown |
| 500 โ 2,000 | Searchable select (combobox) |
| > 2,000 | Typeahead search required |
The survey builder shows an informative hint when a SNOMED dataset is selected. The snomed_search endpoint (GET /surveys/datasets/snomed/search/?q=) powers the typeahead.
Dataset Loading Paths
For full detail on how SNOMED options flow from snomed.db through to the respondent view, the survey builder, and the REST API, see Dataset Loading Architecture.
Key points:
_inject_dataset_options(questions)insurveys/views.pyis the render-time bridge โ it materialises SNOMED options ontoq.optionsfor the respondent template- When
snomed.dbis absent,q.snomed_unavailable = Trueis set so the template renders an informative alert rather than an empty<select> - The REST API (
DataSetSerializer.to_representation()) resolves live SNOMED options as a{sctid: preferred_term}dict, returningsnomed_unavailable: trueon error - CSV export resolves stored SCTIDs to preferred terms using a pre-built per-question lookup table before streaming rows
Response storage: SNOMED answers are stored as SCTIDs, not display terms. SCTIDs are stable identifiers that survive SNOMED release cycles. Preferred terms are resolved at display or export time.
Management Commands
seed_snomed_datasets
Creates or updates DataSet descriptor records for all curated refsets. Does not modify snomed.db. Safe to re-run (idempotent). Use --force to overwrite existing records.
python manage.py seed_snomed_datasets
python manage.py seed_snomed_datasets --force
python manage.py seed_snomed_datasets --dry-run
update_snomed_db
Checks TRUD for a newer SNOMED CT UK Monolith release. If one is found, downloads and rebuilds snomed.db via sct trud download --pipeline, then updates snomed_release_date on all SNOMED descriptor records.
Use --prune when running updates on constrained volumes. It removes stale tmp/, downloaded zip files, and old versioned uk_sct2*.db artefacts before the update check/build starts.
python manage.py update_snomed_db
python manage.py update_snomed_db --force # re-download even if current
python manage.py update_snomed_db --dry-run # check for new release only
python manage.py update_snomed_db --force --prune # free space before rebuild
SNOMED CT UK is published infrequently. For self-hosted deployments, prefer a manual in-container update during a planned maintenance window (for example monthly or only when you want to refresh terminology), rather than a scheduled cron job.
See Scheduled Tasks for Northflank setup.
Environment Variables
# Path to snomed.db (default: /app/data/snomed.db)
SNOMED_DB_PATH=/app/data/snomed.db
# TRUD API key โ required by update_snomed_db; not needed at request time
TRUD_API_KEY=your-trud-api-key
Setup and Self-Hosting
Prerequisites
- TRUD account with a subscription to SNOMED CT UK Monolith Edition (item 1799)
sctbinary in the Docker image (seeDockerfile)TRUD_API_KEYandSNOMED_DB_PATHset in the environment- A persistent volume mounted at
/app/data/(or the configured path)
Building snomed.db
# Inside the web container โ direct all output to the volume
docker compose exec web bash -c \
"TRUD_API_KEY=\$TRUD_API_KEY sct trud download \
--edition uk_monolith \
--download-dir /app/data \
--pipeline \
--output /app/data/snomed.db"
# Then seed Postgres descriptors
docker compose exec web python manage.py seed_snomed_datasets
Self-Hosting Volumes
sct trud download --pipeline produces intermediate files before the final snomed.db:
| Step | Artefact | Approximate size |
|---|---|---|
| Download RF2 zip | Temp download | ~1.5 GB |
sct ndjson |
NDJSON intermediate | ~1 GB+ |
sct sqlite |
Final snomed.db |
~500 MBโ1 GB |
Peak disk usage during a build can exceed 6 GB when temporary artefacts and the existing snomed.db overlap. All intermediate and output files must be written to the mounted persistent volume (not container ephemeral storage). A 10 GB volume is the practical minimum; 20 GB is recommended for safer headroom.
Run updates from inside the existing web/container shell, and use --prune before forced rebuilds to reduce burst usage.
The vault-data volume is dedicated to HashiCorp Vault's Raft storage and must not be shared.
Health Check
The /healthz endpoint reports SNOMED status:
{
"snomed": {
"status": "ok",
"db_path": "/app/data/snomed.db",
"release_date": "2024-10-01"
}
}
If snomed.db is absent or SNOMED_DB_PATH is unset, status is "unavailable" โ not an error.
Graceful Degradation
The application functions without snomed.db. When SnomedUnavailableError is raised:
| Context | Behaviour |
|---|---|
| Dataset list | SNOMED datasets shown with an "Unavailable" badge |
| Dataset detail | Options column shows "SNOMED CT unavailable" |
| Survey respondent | Alert in place of the dropdown: "SNOMED CT terminology is currently unavailable" |
| Survey builder | Warning banner in the dataset picker; SNOMED datasets cannot be loaded |
| CSV export | SCTID stored as-is in the export; no resolution attempted |
| REST API | snomed_unavailable: true in the serialiser response |
No 500 errors are raised. SnomedUnavailableError is caught at every view and serialiser call site.
Future Work
User SNOMED Codelists
Survey creators sometimes need a hand-picked subset of SNOMED concepts specific to their service โ for example, "the 15 diagnosis codes used in our paediatric diabetes MDT". This is distinct from both CheckTick-curated refsets (top-down, platform-maintained) and plain custom datasets (not SNOMED-backed).
A user codelist would be stored as a DataSet with category="snomed" and source_type="user_codelist". Options would be snapshotted as {sctid: preferred_term} in Postgres at creation time (not live from snomed.db), since a frozen copy is appropriate for user-authored lists.
Prerequisites before implementation:
- An FTS5 typeahead UI for searching the full 831k-concept vocabulary (the open concept search endpoint already exists)
- A policy decision on clinical safety guardrails (e.g. active-only concepts, hierarchy constraints)
- Governance model for sharing codelists within or across organisations
The "Create Custom Version" snapshot flow on the dataset detail page provides a partial equivalent for the majority of use cases.
ECL Expression Builder
The snomed_query_type = "ecl" path is implemented in SnomedResolver and can be used via the Django admin today. A guided ECL builder UI for technically confident users โ with live preview and validation against snomed.db โ would unlock more precise, user-authored constraint expressions without requiring SNOMED expertise.
Other Terminology Systems
The architecture is designed to extend. If sct gains support for additional terminologies, or equivalent SQLite files become available, CheckTick can add new resolvers following the same pattern (LoincResolver, etc.) without changes to the DataSet model or the UI layer. A consistent SQLite schema across systems (concepts table with id, preferred_term, FTS5) would allow a single generic TerminologyResolver.
ICD-11 is available via a free WHO REST API (id.who.int) and could be implemented as an external_api dataset source. DSM-5 is proprietary and out of scope unless a freely licensable structured form is identified.
Related Documentation
- Dataset Loading Architecture โ full data flow for all dataset types
- Datasets and Dropdowns โ user-facing guide
- Self-hosting Scheduled Tasks โ manual SNOMED maintenance guidance
sctproject โ the Rust binary that generatessnomed.db- NHS TRUD โ source of SNOMED CT releases
Roadmap: Unified Dataset Creation Assistant
This roadmap describes a user-friendly flow where users ask for a list in plain language, the system checks existing datasets first, and only creates a new SNOMED-driven list when needed. The UI should hide ECL by default.
Product goals
- Single entry point for dataset creation to reduce confusion between NHS DD and SNOMED
- Deterministic source routing: NHS DD first, SNOMED second, never mixed in one dataset
- Fast path for users with known concept IDs
- LLM used as a constrained planner, not as the final source of truth
- Full provenance and audit trail for generated lists
Scope in two features
- Create lists from SNOMED CT concept IDs
- Create lists from natural language via LLM with ECL under the hood, integrated with duplicate checks against NHS DD first
Phase plan
Phase 1: Source-aware assistant shell
Steps:
- Add one "Describe the list you need" entry point in dataset creation UI
- Implement retrieval-first routing service:
- Search NHS DD datasets first
- If no strong match, search SNOMED curated/hosted datasets
- If no suitable match, offer "Create new draft"
- Return source decision and confidence explanation in plain language
Suggested snippet (routing result contract):
from dataclasses import dataclass
from typing import Literal
@dataclass
class DatasetSuggestion:
source: Literal["nhs_dd", "snomed", "new_snomed_draft"]
reason: str
dataset_key: str | None
confidence: float
Phase 2: SNOMED concept ID fast path
Steps:
- Add "Paste concept IDs" mode in the same assistant
- Validate IDs against
snomed.dband resolve preferred terms - Save as user-owned draft dataset with frozen options (
{sctid: term}) - Record SNOMED release metadata at creation time
Suggested snippet (ID validation service):
def resolve_snomed_ids(ids: list[str]) -> dict[str, str]:
"""Return validated {sctid: preferred_term}; raise on invalid IDs."""
# Query concepts table for active concepts only.
# Reject unknown/inactive IDs with actionable error details.
...
Phase 3: Natural language to SNOMED draft (ECL hidden)
Steps:
- Add LLM planning endpoint that outputs structured proposal JSON
- Proposal includes:
- candidate ECL (internal)
- plain-language inclusion/exclusion rationale
- assumptions and ambiguity notes
- Execute proposal deterministically against
snomed.db - Show preview (count, sample concepts, source badge) before save
- Save as draft only; require explicit user confirmation to publish/share
Suggested snippet (LLM proposal schema):
class SnomedDraftProposal(TypedDict):
user_intent: str
candidate_ecl: str
include_notes: list[str]
exclude_notes: list[str]
assumptions: list[str]
Phase 4: Governance and drift management
Steps:
- Add review statuses: draft, review, published
- Enforce reviewer confirmation before published status
- Add "revalidate against current SNOMED release" operation
- Add drift report for inactive concepts and term changes
Guardrails and non-negotiables
- Do not mix NHS DD and SNOMED options within one dataset
- Retrieval-first before any LLM generation
- Active concepts only unless explicitly overridden by policy
- Hard caps for generated list size and query execution time
- Never expose raw internal exception messages in API responses
- Persist provenance for each generated draft:
- prompt
- planner output
- final query/ECL
- SNOMED release date
- creator/reviewer
Suggested implementation components
DatasetIntentRouterservice:suggest_existing(query: str) -> list[DatasetSuggestion]- deterministic scoring with source priority (
nhs_dd>snomed) SnomedDraftBuilderservice:from_concept_ids(ids: list[str])from_natural_language(prompt: str)DatasetDraftmodel fields (or equivalent metadata fields):origin_source,creation_mode,snomed_release_date,provenance_blob
Test coverage plan
Unit tests
- Router priority:
- NHS DD exact/strong match chosen over SNOMED alternatives
- SNOMED fallback only when NHS DD no-match or low confidence
- Source isolation:
- creating datasets never mixes categories in options payload
- SNOMED ID path:
- invalid/inactive IDs rejected
- valid IDs resolve to stable
{sctid: term}mapping - LLM proposal validation:
- malformed proposal rejected before query execution
- empty/too-broad proposal handled with actionable feedback
- Security behavior:
- no raw exception leakage in user/API error messages
Integration tests
- End-to-end natural language flow:
- query -> suggestion -> preview -> draft saved
- Duplicate prevention:
- existing NHS DD list suggested instead of new SNOMED draft
- Review lifecycle:
- draft cannot be published without reviewer step
- Drift check:
- revalidation reports inactive/changed terms when release updates
Performance tests
- Latency budget for assistant query path (router + preview)
- Large generated set handling (cap, pagination, and timeout behavior)
Documentation plan
Update and/or add documentation when implementing this roadmap:
- Update
docs/datasets-and-dropdowns.md - explain unified assistant and source badges
- include concept ID fast path guidance
- Update
docs/dataset-loading-architecture.md - include assistant routing and draft creation flow
- document provenance fields and source-selection rules
- Update
docs/snomed-integration.md(this document) - move implemented roadmap items into main architecture sections
- Add
docs/llm-dataset-assistant.md(new) - prompt contract, guardrails, fallback behavior, and known limitations
- Add API docs in
docs/api.md - suggestion endpoint
- preview endpoint
- draft creation endpoint
- validation/error response examples
Success criteria
- Users can request lists in plain language without SNOMED/ECL expertise
- Existing NHS DD datasets are reused whenever appropriate
- SNOMED custom lists are reproducible, auditable, and reviewable
- No category mixing and no sensitive error leakage
- Draft-to-publish flow is clinically governable and test-backed
Decision Log (Open)
Use this section to track unresolved product and safety decisions during implementation. Keep each item updated with owner/date when a decision is made.
| Topic | Decision needed | Suggested default | Status |
|---|---|---|---|
| NHS DD vs SNOMED routing threshold | What confidence score should trigger "use existing NHS DD" vs "create SNOMED draft"? | Prefer NHS DD when confidence >= 0.75 and no explicit override in user intent | Open |
| Duplicate detection strictness | Should near-duplicate custom lists be blocked or warned? | Warn by default, block only when exact option-set match | Open |
| Generated list size cap | Maximum concept count for draft generation and preview? | Cap at 2,000 for immediate preview, require refined query beyond cap | Open |
| Publish permissions | Who can publish SNOMED drafts to shared/global scope? | Org admin or designated clinical reviewer only | Open |
| Review signoff policy | Is one reviewer enough, or dual signoff for high-impact lists? | One reviewer for org-private lists, two for org-shared/global lists | Open |
| SNOMED release pinning | Should draft creation always pin to current release date? | Yes, mandatory release pin + revalidation prompt on release change | Open |
| LLM provider and fallback | Which hosted model(s), and what fallback behavior on timeout? | Primary hosted model + deterministic non-LLM fallback to retrieval suggestions only | Open |
| Prompt/data retention | Should user prompts used for list generation be retained? | Retain minimal provenance for audit, redact sensitive free text where possible | Open |
| User-visible rationale depth | How much explanation is shown to non-technical users? | Plain-language summary with expandable "details" panel | Open |
| Internationalization | Which languages are supported in assistant prompts and explanations? | Start with English only, add i18n once workflow stabilizes | Open |
Decision record template
When a row is decided, add a short note below using this template:
- [YYYY-MM-DD] Topic: <topic>
Decision: <what was decided>
Owner: <name/role>
Rationale: <1-3 lines>
Follow-up: <tickets/docs/tests>