FTRM: an open container format for FHIR terminology
FTRM
I’m pleased to announce the release of an open specification for terminology: FTRM
- SQLite is a fantastic application file format - SQLite database files are portable between 32-bit and 64-bit machines and between big-endian and little-endian architectures and between any of the various flavors of Windows and Unix-like operating systems.
- SQLite is accessible, portable and reliable
If we as a community agree an open specification, we can make it easier to meaningful share terminology data such as code systems, value sets and maps across systems.
Introduction
If you build clinical software, whether operational or as part of data analytics,
sooner or later you have to ship
terminology — e.g. SNOMED CT, a value set for a
research cohort, a concept map between two coding systems. The
FHIR terminology resources (CodeSystem, ValueSet, ConceptMap,
NamingSystem) are the right abstraction. But the distribution
unit — the file you actually copy around, install on a server, drop
into a CI pipeline, or hand to a colleague — has been a recurring
source of friction.
You can ship raw FHIR JSON. That works for a few hundred kilobytes of conformance resources, and falls apart at the scale of even a modest national subset. You can stand up a Postgres instance and load it from a vendor’s bespoke pipeline. That works for a single deployment, and breaks the moment you need a second one or want to hand a snapshot to someone outside your organisation. Or you can serve everything from a remote terminology service over HTTP — fine when you have the network, untenable when you do not. Making billions of server round-trips to a national terminology ‘server’ isn’t going to work - most throttle requests.
I’ve long felt being able have the flexibility to run software like this on your own machine is important, and it is ideal if that code is open-source.
Why a file format?
What’s been missing is a single file you can email, scp, drop in S3, or commit to a release artefact, that contains the canonical terminology resources, is fast enough to serve directly, and is portable between implementations.
I’d like to introduce: FTRM
— a FHIR Terminology Container. It is an open specification for a
single SQLite database that holds FHIR CodeSystem, ValueSet,
ConceptMap and NamingSystem resources, with enough structure to
serve the canonical terminology operations directly off the file —
$lookup, $validate-code, $expand, $translate, $subsumes.
A note on scope: FTRM v1 is FHIR-shaped. It does not define SNOMED CT’s expression constraint language (ECL) or reference-set semantics, and it is not where I currently put SNOMED. The reference implementation, hades, serves SNOMED CT through a separate dedicated store — hermes, built on LMDB and Lucene, with native ECL evaluation — and uses FTRM for LOINC and HL7 FHIR NPM packages. The composite catalogue dispatches by canonical URL, so from a client’s perspective it is one terminology server. The separation is intentional: SNOMED-specific operations benefit from SNOMED-specific data structures, and FTRM’s job is to be the boring, portable container for everything that fits the plain FHIR terminology shape.
Why SQLite
The choice of SQLite is deliberate. SQLite is the most-deployed
database in the world; it is in every browser, phone and operating
system. The file format is
stable for the long term and
has a published, public-domain reference implementation. There is
no daemon, no orchestrator, no clustering layer to install. A FTRM
file is just bytes — you can sha256sum it, version it, mirror it,
and verify it as you would any other release artefact.
And critically: a FTRM file is queryable with stock tooling. If
you want to know how many concepts a code system has, how its
hierarchy is shaped, or which value sets reference which systems,
you open the file with the sqlite3 CLI and write SQL. There are
no opaque binary blobs, no implementation-defined indexes you have
to reverse-engineer, no proprietary on-disk structures. Everything
is documented, including the foreign keys and the FTS5 indexes.
What’s in the file
The schema separates two kinds of fields: the ones a server needs to reason about, and the ones it has to round-trip without understanding.
The first set — case_sensitive, hierarchy_meaning,
status, experimental, publisher and so on — are real columns
that you can index and filter. The second set — copyright, contact
points, language extensions, useContext, the long tail of FHIR
metadata that varies between resources but that a terminology
server doesn’t dispatch on — round-trips through a single JSON
column called metadata. That keeps the schema honest about what
matters operationally, while preserving fidelity for everything else.
Concept properties follow the same idea. The four well-known
concept properties (inactive, abstract, notSelectable,
status) are projected onto columns of the concept table, so the
common filters are cheap. Everything else lives in
concept_property with FHIR’s value[x] polymorphism mirrored as
typed columns (value_str, value_int, value_bool, …).
The concept_ancestor table is the transitive closure of the
parent edges, with the shortest-path depth as a column. Subsumption
queries (is-a, descendent-of) become a single indexed lookup
rather than a recursive CTE. Writers rebuild the closure when they
write the file; readers trust it.
Free-text search uses SQLite’s FTS5, with unicode61
remove_diacritics 2 baked into the contract so cafe matches
café regardless of where the file was built. That detail matters
— if FTS tokenisation drifts between writers, the same query
returns different results on different files, and the format stops
being portable in any meaningful sense.
Versioning
FTRM stores an arbitrary number of versions of any resource. The
identity tuple is (resource_type, url, version). Selection between
versions is the caller’s responsibility, driven by the FHIR
system-version and force-system-version parameters at request
time. There is no ambient “current version” baked into the file —
which means a single FTRM file can hold this year’s LOINC release
alongside a frozen snapshot from two years ago, or successive
versions of hl7.terminology.r4, and a research pipeline can
reproduce its inputs by referring to versions explicitly.
Writes are replace-by-tuple: a writer that re-imports a
(CodeSystem, url, version) triple deletes the dependent rows
keyed by that tuple and re-inserts. There is no patch-in-place
journal. The file is a snapshot.
File identification and conformance
Every FTRM v1 file stamps two values into the SQLite header:
application_id = 0x4654524D (ASCII FTRM) and user_version = 1.
A reader that opens a file without the FTRM stamp must refuse it.
That’s a sanity gate: it stops a generic SQLite file being
mistakenly treated as a terminology container, and it gives file(1)
something to recognise.
The specification defines what it means to be a conforming
reader and a conforming writer. A reader that enumerates
tx_resource, registers each row as an addressable resource, and
honours the well-known property columns at lookup time — and the
FTS indexes for text search, and concept_ancestor for filter
pushdown — is conformant. A writer that produces files passing
PRAGMA foreign_key_check and rebuilds the ancestor closure and
FTS indexes is conformant. The list is short and operational, not
aspirational.
Why this matters
The point of an open specification is that no single vendor owns the format. The schema, identifiers and prose of FTRM v1 are released under CC0 — the public-domain dedication. Anyone can write a reader or a writer, in any language, without asking permission and without licensing friction. The reference implementation is hades, my open-source FHIR terminology server, and it passes 81.8% of the HL7 FHIR Terminology Ecosystem IG conformance suite at the time of writing. But the format does not depend on hades, and a writer in Python, Java, Rust or Go that follows the spec produces files that any other conforming reader can serve.
For the NHS, where I sit, this matters in a very specific way. We have years of accumulated experience with terminology distribution that has been awkward at best. SNOMED CT, dm+d, the NHS Data Dictionary, the Welsh national subsets — each comes with its own pipeline, its own format, its own assumptions about what a server will do with it. Layering a FHIR-shaped, SQLite-backed, vendor-neutral container on top of that does not solve the underlying provenance and governance questions, but it does mean that once those questions are answered, the artefact you ship is boring, ubiquitous and inspectable.
The same applies to research. A reproducible cohort study should
be able to declare its terminology inputs by file hash. Today, that
typically means archiving a tarball of source releases and a
loader script. With FTRM, it means archiving a single .db file
and a sha256 — and any reader can serve it.
What’s next
FTRM is at v1. It is stable; breaking changes will bump the schema
version and ship migration tooling. The spec lives in the hades
repository at
doc/ftrm.md,
and the reference DDL is the SQL file the implementation actually
runs against. There is no separate paper.
If you build, ship, or consume FHIR terminology, please read it, try it, and tell me where it is wrong. The schema is the contract; sharpening that contract is how the format becomes useful beyond my own work.
Mark