Evaluating Biomedical Word Embeddings

Evaluating Biomedical Word Embeddings

Summer of Research project by student Aaron Zhang (The University of Auckland), supervised by Dr Edmond Zhang (Orion Health).

It takes time and money to train specialist medical algorithms, and medical-specific data is hard to obtain for research, but the payoff may well be worth it.

Aaron Zhang’s research shows machine learning algorithms based on medical databases are more accurate than those trained on Wikipedia and other general datasets for correctly identifying drugs and medical conditions in text.

Deep learning is showing promise for helping medical researchers investigate subjects better and faster. One of the limitations is a lack of labelled data for this purpose. [1]

Pre-trained word embeddings based on Twitter, Google News and Wikipedia already exist and researchers in different fields use them successfully. Aaron’s aim was to compare these with word embeddings specially created using the MIMIC III, Pubmed and Pubmed Central datasets.

Word embedding assigns each word a vector and places it on a map where words with similar meanings are close by. This makes it easier for a computer to analyse text by its meaning for natural language processing.

For example, “good,” great” and “excellent” would all be assigned similar vectors. Aaron hoped that by using medical data, word embedding could more accurately map rare words such as the names of health conditions and medicines.

Aaron created word embeddings using three popular embedding algorithms: word2vec, GloVe and fastText. He also tested different types of pre-processing, such as removing uppercase letters, punctuation, symbols and numbers from the texts.

Each model took between eight and 12 hours to train, and was tested five times to assess its accuracy.

The results suggest clinical word embeddings are more accurate at clinical tasks than general purpose pre-trained word embeddings. This was likely because the Wikipedia, Twitter and Google News datasets suffered from missing certain vocabulary as well as less accurately mapping medical terms. The Wikipedia dataset performed the best of the three pre-trained models, with an accuracy of 89.2% (F1 score).

The MIMIC III dataset produced the best-performing model with an accuracy of 90.5%. This was likely because MIMIC III had the most information of the three medical datasets. By fine-tuning the pre-processing for the algorithm, the accuracy increased to 91.9%.

Future work in this area could include using different datasets, and training algorithms that recognise context – such as knowing the difference between “apple” the fruit and “Apple” the company based on other words in the sentence.

Aaron Zhang is one of 10 students who took part in the Summer of Research programme funded by Precision Driven Health. This research is at an early “proof of concept” stage. The projects offer fresh insights into what healthcare will look like when precision medicine is widely used.

  1. Travers Ching, Daniel S Himmelstein, Brett K Beaulieu-Jones, Alexandr A Kalinin, Brian T Do, Gregory P Way, Enrico Ferrero, Paul-Michael Agapow, Michael Zietz, Michael M Hoffman, et al. Opportunities and obstacles for deep learning in biology and medicine. Journal of The Royal Society Interface, 15(141):20170387, 2018.