If a patient has a type of motor neurone disease, then…

If a patient takes an ACE inhibitor, then …

If a patient has a type of autoimmune disease, then …

Whether we are writing software to build compelling user-facing applications, building rules or decision support, or choosing categories from which we will run analyses such as understanding patient outcomes, we need to be able to process health and care data and make inferences.

My open source software codelists generates versioned codelists for reproducible data pipelines and research.

In general, there are two ways to think about codelists and reproducibility.

The first is a explicit human curation of a list of codes. This is the approach adopted by Ben Goldacre and the opencodelists team. You create and share codelists.

The second is to define a codelist using a declarative specification which can be used to dynamically - but reproducibly - generate the codelist.

Let’s look at an example. opencodelists has a codelist to specify terms that represent a referral to colorectal services under a 2-week wait (urgent) basis. See https://www.opencodelists.org/codelist/phc/2ww-referral-colorectal/7eac259d/#full-list

At the time of writing, this manually curated list includes one active concept and two inactive concepts.

276401000000108	Fast track referral for suspected colorectal cancer
276411000000105	Urgent cancer referral - colorectal
276421000000104	Urgent cancer referral - colorectal

While this is a useful set of curated terms, I would argue that it is better to simply define this codelist using a SNOMED CT constraint using the syntax of the SNOMED CT expression constraint language:

{
  "ecl": "<<276401000000108"
} 

Result:

=> #{276401000000108 276411000000105 276421000000104}

Based on a named versioned distribution of SNOMED, and defined versions of this tool, this specification can be used to generate a reproducible codelist. If SNOMED CT changes over time, this specification will continue to work, due to the semantic relationships within SNOMED CT. codelists can expand a set of codes to include now inactive concepts using historical associations.

Certainly for drugs, a declarative rules approach works better than the manually curated set of opencodelists. If new drugs of a type are added to the UK dictionary of medicines and devices (dm+d), then codelists will include those new drugs without any manual intervention, while manual curation requires continued monitoring and maintenance of code lists.

How to use codelists

You can define codelists using a variety of means, such as

  • ICD-10 codes for diagnoses
  • ATC codes for drugs
  • SNOMED CT expressions in the expression constraint language (ECL).

You can combine these approaches for high sensitivity, or manually derive codelists using hand-crafted ECL for high specificity.

codelists is a simple wrapper around two other services - hermes and dmd. I think it is a nice example of composing discrete, but related services together to give more advanced functionality.

codelists operates:

  • as a library and so can be embedded within another software package running on the java virtual machine (JVM), written in, for example java or clojure.
  • as a microservice and so can be used as an API by other software written in any language

The substrate for all codelists is SNOMED CT. That coding system is an ontology and terminology, and not simply a classification. That means we can use the relationships within SNOMED CT to derive more complete codelists.

If you only use the SNOMED CT ECL to define your codelists, then simply use hermes directly. You only need the additional functionality provided by codelists if you are building codelists from a combination of SNOMED CT ECL, ATC codes and ICD-10.

ATC maps are not provided as part of SNOMED CT, but are provided by the UK dm+d. ICD-10 maps are provided as part of SNOMED CT.

Using codelists

You can realise a codelist, expanding it to all of its codes. You can also test membership of a given code against a codelist.

All codelists, by default, expand to include historic codes. This will become configurable, but is the default for greater sensitivity at the expense of specificity. Different trade-offs might apply to your specific project.

Boolean logic is supported, with arbitrary nesting of your codes using a simple DSL.

A codelist is defined as names and values in a map, with the names representing the codesystem and the values the specification.

{
  "ecl": "<<24700007"
} 

This defines a codelist using the SNOMED expression constraint language (ECL). While ECL v2.0 supports the use of historic associations within constraints, I usually recommend ignoring that ‘feature’ and instead defining whether and how historic associations are included as part of the API.

SNOMED CT, in the UK, includes the UK drug extension with a 1:1 map between SNOMED identifiers and drugs in the official UK drug index - dm+d (dictionary of medicines and devices). That means you can use a SNOMED expression to choose drugs:

{
  "ecl": "(<<24056811000001108|Dimethyl fumarate|) OR (<<12086301000001102|Tecfidera|) OR (<10363601000001109|UK Product| :10362801000001104|Has specific active ingredient| =<<724035008|Dimethyl fumarate|)"
}

Note how SNOMED ECL includes simple boolean logic.

But `codelists’ supports other namespaced codesystems. For example:

{
  "atc": "L04AX07"
}

Will expand to a list of SNOMED identifiers that are mapped to the exact match ATC code L04AX07 and its descendents within the SNOMED hierarchy.

A SNOMED CT expression in the expression constraint language must be a valid expression. ICD-10 and ATC codes can be specified as an exact match (e.g. “G35”) or as a prefix (e.g. “G3*”). The latter will match against all codes that begin with “G3”.

Different codesystems can be combined using boolean operators and prefix notation:

{
  "or": [
    {
      "atc": "L04AX07"
    },
    {
      "atc": "L04AX08"
    },
    {
      "ecl": "(<10363601000001109|UK Product| :10362801000001104|Has specific active ingredient| =<<724035008|Dimethyl fumarate|)"
    }
  ]
}

This expands the ATC codes L04AX07 L04AX08 and supplements with any other product containing DMF as its active ingredient.

If multiple expressions are used, the default is to perform a logical OR. That means this is equivalent to the above expression:

[
  {
    "atc": "L04AX07"
  },
  {
    "atc": "L04AX08"
  },
  {
    "ecl": "(<10363601000001109|UK Product| :10362801000001104|Has specific active ingredient| =<<724035008|Dimethyl fumarate|)"
  }
]

Duplicate keys are not supported, but multiple expressions using different keys are.

{
  "atc": "L04AX07",
  "ecl": "(<10363601000001109|UK Product| :10362801000001104|Has specific active ingredient| =<<724035008|Dimethyl fumarate|)"
}

When no operator is explicitly provided, a logical ‘OR’ will be performed.

For concision, all keys can take an array (vector), which will be equivalent to using “or” using the same codesystem.

{
  "atc": [
    "L04AX07",
    "L04AX08"
  ]
}

Boolean operators “and”, “or” and “not” can be nested arbitrarily for complex expressions.

codelists also supports ICD-10.

{
  "icd10": "G35*"
}

will expand to include all terms that map to an ICD-10 code with the prefix “G35”, and its descendents.

The operator “not” must be defined within another term, or set of nested terms. The result will be the realisation of the first term, or set of nested terms, MINUS the realisation of the second term, or set of nested terms.

{
  "icd10": "G35",
  "not": {
    "ecl": "<24700007"
  }
}

Or, perhaps a more complex expression:

{
  "or": [
    {
      "icd10": "G35"
    },
    {
      "icd10": "G36.*"
    }
  ],
  "not": {
    "ecl": "<24700007"
  }
}

Or, more concisely:

{
  "icd10": [
    "G35",
    "G36.*"
  ],
  "not": {
    "ecl": "<24700007"
  }
}

These will generate a set of codes that includes codes “G35” and any with the prefix “G36.” but omit “24700007” ( multiple sclerosis).

You can use wildcards. Here I directly use a running codelists HTTP server to expand a codelist defined as

{
  "atc": "C08*"
}

This should give a codelist containing all calcium channel blockers.

http '127.0.0.1:8080/v1/codelists/expand?s={"atc":"C08*"}'

Result:

[
  374049007,
  13764411000001106,
  376841009,
  11160711000001108,
  893111000001107,
  29826211000001109,
  376754006,
  ...

You can customise how data are returned.

By default, a list of codes is returned.

To return identifier and name, use ‘as=names’

http '127.0.0.1:8080/v1/codelists/expand?s={"atc":"C08*"}&as=names'

Result:


[
  {
    "id": 374049007,
    "term": "Nisoldipine 20mg tablet"
  },
  {
    "id": 13764411000001106,
    "term": "Amlodipine 5mg tablets (Apotex UK Ltd)"
  },
  {
    "id": 376841009,
    "term": "Diltiazem malate 120 mg oral tablet"
  },
  {
    "id": 11160711000001108,
    "term": "Exforge 10mg/160mg tablets (Novartis Pharmaceuticals UK Ltd)"
  },
  {
    "id": 893111000001107,
    "term": "Tildiem LA 300 capsules (Sanofi)"
  },
  ...

For reproducible research, codelists will include information about how the codelist was generated, including the releases of SNOMED CT, dm+d and the different software versions. It should then be possible to reproduce the content of any codelist. At the moment, only the data versions are returned:

http 127.0.0.1:8080/v1/codelists/status

The following metadata will be returned:


{
  "dmd": {
    "releaseDate": "2022-05-05"
  },
  "hermes": [
    "© 2002-2021 International Health Terminology Standards Development Organisation (IHTSDO). All rights reserved. SNOMED CT®, was originally created by The College of American Pathologists. \"SNOMED\" and \"SNOMED CT\" are registered trademarks of the IHTSDO.",
    "32.12.0_20220413000001 UK drug extension",
    "32.12.0_20220413000001 UK clinical extension"
  ]
}