Feature importance for adverse drug event named entity recognition

Feature importance for adverse drug event named entity recognition

Summer of Research project by Hamish Huggard, University of Auckland, supervised by Dr Edmond Zhang (Precision Driven Health and Orion Health).

Automatically identifying drug names, dosages and effects in health records could help researchers find relationships between certain medications and negative effects on patients.

An adverse drug event (ADE) is when something bad happens to a patient’s health while taking a drug, that isn’t necessarily a direct result of the drug. The cost of ADEs has become clearer in the last couple of decades, with one study estimating the annual cost of preventable ADEs in a 700-bed hospital to be $2.8 million. [1]

Hamish Huggard trained a neural network model to analyse health records and identify the names of drugs and related terms, then created more models with different combinations of features to see which features were the most useful.

Named entity recognition (NER) is a type of natural language processing where the algorithm identifies and sorts different types of named entities such as people, places, objects and times. In this case the algorithm looks for drugs, dosages, reasons for giving drugs, methods of giving drugs, ADEs and similar words.

The features of the algorithm Hamish tested were:

  • Word embedding based on the MIMIC III database. Words are assigned vectors and placed on a map where words with similar meanings are close by.
  • Three lexical features. These describe whether a word is in any of three drug dictionaries.
  • Eight orthographic features. These describe the makeup of a word – uppercase letters, lowercase letters, numbers, symbols, whether the first letter is capitalised, and two features describing the “shape” of the word.
  • Two syntactic features. These describe a word or group of words’ type – noun, verb etc. – and how it fits into a sentence.
  • A character-level representation feature. This describes a word based on the characters it is made of.

The model using all these features was 92.8% accurate (F1 score).

By testing feature types individually, Hamish found word embedding to be the most useful feature with an accuracy of 90.8%, followed by character representations, then orthography, then syntax, then lexical features.

Lexical features performed surprisingly poorly, with an accuracy of 12.6%. It may not be worthwhile using them for this type of analysis.

Hamish also tested different window sizes – the number of surrounding words the algorithm considers for context to identify a word’s meaning. Larger windows proved to be more accurate, up to a window of 64 words.

Future work in this area could involve fine-tuning the algorithm or testing an entirely different algorithm architecture.

Hamish Huggard is one of 10 students who took part in the Summer of Research programme funded by Precision Driven Health. The research is at an early “proof of concept” stage. The projects offer fresh insights into what healthcare will look like when precision medicine is widely used.

1. David W Bates, Nathan Spell, David J Cullen, Elisabeth Burdick, Nan Laird, Laura A Petersen, Stephen D Small, Bobbie J Sweitzer, and Lucian L Leape. The costs of adverse drug events in hospitalised patients. Jama, 277(4):307-311, 1997.