Large scale clinical text processing and process optimization

Olga V. Pattersona, b, Scott L. DuValla-d

aVA Salt Lake City Health Care System, Salt Lake City, UT, USA

bDepartment of Internal Medicine, University of Utah School of Medicine, UT, USA

c Department of Pharmacotherapy Outcomes Research Center, University of Utah School of Pharmacy, UT, USA

d Department of Radiology, University of Utah School of Medicine, UT, USA



This tutorial outlines the benefits and challenges of processing large volumes of clinical text with natural language processing (NLP). As NLP becomes more available and is able to tackle more complex problems, the ability to scale to millions of clinical notes must be considered. The Department of Veterans Affairs (VA) has more than 2 billion clinical notes and has developed NLP libraries to be able to tackle projects of that scale. Participants will be introduced to existing tools and resources for large-scale NLP tasks, including Unstructured Information Management Architecture Asynchronous Scaleout (UIMA AS), the Framework Launching Application (FLAP) NLP libraries, and the JMX Analysis Module (JAM) monitoring tool. The methods of computational performance analysis will be described and process optimization solutions will be demonstrated. Participants will be introduced to the goals, tasks, and methods of text processing. Characteristics that distinguish clinical text from general language and complicate text processing will be outlined. Participants will be walked through a scenario of creating and launching an asynchronous NLP pipeline, monitoring it for performance metrics and identifying bottlenecks, and redeploying the pipeline with an optimal configuration. The tutorial is taught by two instructors experienced as researchers, developers, and users of NLP tools for large-scale clinical projects. It is geared towards hands-on learning and the instructors will interact with the attendees to more effectively facilitate the learning experience.


Outline of topics

This tutorial introduces the challenges of large data processing and describes a set of tools and approaches that can be employed for clinical text processing. Main topics include:

  • clinical NLP overview,

  • challenges of large scale NLP,

  • scalable design,

  • overview of tools,

  • scalable deployment,

  • system monitoring, and

  • system reconfiguration, and optimization.


Learning objectives

At the end of the tutorial, participants will have:

  • Understanding of the challenges of large-scale NLP

  • Understanding of distributed processing architecture

  • Ability to deploy a data processing pipeline in a distributed environment.

  • Familiarity with tools need for process monitoring.

  • Skills to recognize processing bottlenecks.

  • Understanding of approaches for performance optimization.


Targeted audience

This tutorial is intended for informaticians, application programmers in clinical settings, and researchers with an interest in implementing NLP tools for processing of large datasets. Familiarity with high performance and distributed computing systems is helpful, but not required. Familiarity with Java programming language is helpful.