Data-Driven Healthcare, IBM Research Africa, Johannesburg, South Africa

The National Cancer Registry (NCR) in South Africa plays a significant role in reporting nationwide cancer statistics and raising the global awareness of the massive impact of cancer. The government requires confirmed cancer cases to be reported to the NCR. Due to manual processes and the increasing magnitude of reports received annually, a considerable lag time exists in cancer statistics, which means the extent of the cancer cases is currently not understood. In addition, the unstructured free-text also needs to be processed in order to identify clinical information that could be important for public health planning. We present initial results from a deep learning approach to address this time lag. Deep learning is a powerful machine-learning algorithm that has made strides in the area of medical image recognition and speech processing. The deep learning system takes as input 2000 de-identified breast cancer pathology reports provided by the NCR in collaboration with the University of Witwatersrand Medical School. The pathology reports are first preprocessed using the Tf-idf (term frequency-inverse document frequency) method, which suggests how important a word is to a document in a corpus by assigning a numerical statistic to each word and hence obtain a term frequency document matrix. The high dimensional data matrix is, input into an unsupervised learning autoencoder, a data compression algorithm used to attain rich features that best represents the specific breast cancer topography and morphology. Unlike other approaches, our approach relies on non-dictionary sources such as clinical empirical knowledge extracted from the reports and dictionary sources such as the 12,000 medical diagnoses available in the International Statistical Classification of Diseases and Related Health Problems (ICD-10). The output from the deep learning system can be used to automate the classification of reports into their corresponding topography and morphology. The system could also be used to create a visual analytics system to aid data exploration and trend analysis of the current state of cancer in South Africa.

Citation Format: Waheeda Banu Saib, Pavan Kumar, Geoffrey Siwo, Gciniwe Dlamini, Elvira Singh, Sue Candy, Michael Klipin. A deep learning approach for extracting clinically relevant information from pathology reports [abstract]. In: Proceedings of the AACR International Conference: New Frontiers in Cancer Research; 2017 Jan 18-22; Cape Town, South Africa. Philadelphia (PA): AACR; Cancer Res 2017;77(22 Suppl):Abstract nr A11.