1. Introduction
The radiology report is an invaluable tool used by radiologists to communicate high-level insights and analyses of medical imaging investigations. It is common practice to organize such an analysis into specific sections, documenting the key information taken into account to determine the final impression/opinion [
1]. The analysis of this report information is important in medical image data analysis for automated large-dataset labeling in machine learning computer vision. Unfortunately, computers cannot interpret and categorize raw text, and it is infeasible to manually label a large radiology corpus that can contain billions of words. Therefore, being able to automatically extract this image information in free-text radiology reports is ideal. To solve this, many researchers use natural language processing (NLP) techniques to extract radiology report information automatically. In a systematic review by Casey et al. [
2], NLP techniques are used for patient or disease surveillance, identifying disease information for classification systems, language analysis optimized to facilitate clinical decision support, quality assurance, and epidemiology cohort building. In clinical breast cancer management and screening, this could include the surveillance of benign-appearing lesions over time to determine if biopsy is needed [
3], or the investigation of diagnostic utilization and yield to determine hospital resource allocation [
4].
A prime opportunity for NLP applications exists in the breast radiology reports of mammograms, ultrasounds, magnetic resonance imaging (MRI) exams, and biopsy reports. This information is organized into designated sections to keep reports clear and concise. The criteria and organization of this reporting system was first formalized in the 1980s by the American College of Radiologists in the Breast Imaging Reporting and Data System (BI-RADS) [
5]. In a breast radiology report, many important health indicators, including menopausal status and history of cancer, are recorded together with the purpose of the exam in a section typically called clinical indication (Cl. Ind.). These details give evidence to whether the exam is for a routine screening or a diagnostic investigation of an abnormality. Imaging findings include the presence of lesions, breast density, and background parenchymal enhancement (BPE) (specifically in breast MRI). These health indicators and imaging findings can be very useful for patient care, treatment management, and research, such as large-scale epidemiology studies. For example, breast density and BPE are factors of interest in breast cancer risk prediction [
6,
7]. Breast density is the ratio of radiopaque tissue to radiolucent tissue in a mammogram, or the ratio of fibroglandular tissue to fat tissue in an MRI, while BPE is the level of healthy fibroglandular tissue enhancement during dynamic contrast-enhanced breast MRI. Both of these factors have been shown to have an association with the incidence of breast cancer.
Recent advancements in NLP models, notably, the bi-directional encoder representations from transformers (BERT) model developed in 2018 by Google [
8], have resulted in significant performance improvements over classic linguistic rule-based techniques and word to vector algorithms for many NLP tasks. Devlin et al. showed that BERT is able to outperform all previous contextual embedding models at text sequence classification and question-and-answering response. BERT techniques were swiftly adopted by medical researchers to build their own contextual embeddings trained specifically for clinical free-text reports, such as BioBERT [
9] and BioClincal BERT [
10], showing the importance of a domain-specific contextual embedding.
Growing in popularity is the concept of utilizing report section organization to better improve health indicator field extraction [
11,
12,
13]. The BI-RADS lexicon includes a logically structured flow of sections for the title of the examination, patient history, previous imaging comparisons, technique and procedure notes, findings, impressions/opinions, and an overall exam assessment category [
5]. Since this practice is so well documented and followed diligently by breast radiologists, it is an ideal dataset in which to determine whether the automatic structuring of free-text radiology reports into sections will improve health indicator field extraction. We hypothesize that, using a specialized BERT embedding trained in breast radiology and fine-tuned for section segmentation and field extraction, used in sequence, will give better performance than the classic BERT embedding fine-tuned on field extraction.
With this project, we built a new contextual embedding with the BERT architecture, called BI-RADS BERT. Our data was collected from the Sunnybrook Health Sciences Centre’s medical record archives, with research ethics approval, comprised of 180,000 free-text reports in mammography, ultrasound, MRI, and image-guided biopsy procedures performed between 2005–2020. Additionally, all pathological findings in image-guided biopsy procedures were appended to the corresponding imaging reports as an addendum. We pre-trained our model using masked language modeling on a corpus of 25 million words, and then fine-tuned the model on free-text section-segmented reports to divide reports into sections. In our exploration, we found it beneficial to use the contextual embedding in conjunction with auxiliary data (BI-RADS BERTwAux) to better understand the global report context in the section segmentation task. Then, with the section of interest in a report identified, we fine-tuned further-downstream classifiers to identify imaging modality, the purpose for the exam, mention of previous cancer, menopausal status, density category, and BPE category.
5. Discussion
This report has presented the application of a BERT embedding for report section segmentation and field extraction in breast radiology reports. With different implementations and a specialized BI-RADS BERT contextual embedding pre-trained on a large corpus of breast radiology reports, we have shown that a BERT model can be effective at splitting a report’s sentences into specific sections described in the BI-RADS lexicon; then, within those report sections, it can identify pertinent patient information and findings, such as modality used/procedure performed, record of previous breast malignancy, purpose for the exam (being either diagnostic or screening), the patient’s menopausal status, breast density category, and BPE category, specifically in breast MRI.
The improved accuracy could not have been possible without structured reporting in radiology reports [
1] and the BI-RADS lexicon [
5]. The section structure from the American College of Radiology’s handbook for residents instructs the radiologist to write reports as a scientific report to respond to the requesting clinician’s inquiry. To identify the information in a free-text report for a given interest, centering on a section via section segmentation gives the advantage of not searching through unnecessary details for the answer. This advantage of the BI-RADS BERT model makes it more desirable than previous methods.
It is important to note that these results support the findings by Lee at al. [
9] and Sentzer et al. [
10], namely, that having a specialized BERT contextual embedding in your domain gives an advantage for performing NLP tasks. Here, we have shown that breast radiology imaging reports also have a distinct style and terminology which may not show up in English text corpora, web-based corpora, biological research paper corpora, or intensive care unit reporting corpora. This improvement may be explained by the process of training the embeddings from scratch and creating a specialized tokenizer that understands phrases that are common to the domain [
28]. For example, we found the word "mammogram" was split up differently depending on which embedding WordPiece tokenizer was used. This example is shown in
Table 6. For the classic BERT WordPiece tokenizer, "mammogram" is split into four parts, while the BioClincal BERT splits it into three parts. Our specialized BI-RADS WordPiece tokenizer gives one part for "mammogram", as a it is the most commonly used breast imaging modality and, thus, makes the embedding more efficient at identifying these important concepts as a whole, as opposed to a combination of sequential word pieces.
Furthermore, a specialized WordPiece tokenizer gives an advantage at capturing text data into shorter sequences that contain more domain-specific information. Radiologists are taught to keep reporting concise [
1], leading to many smaller sentences and statements that directly correspond to the concept the radiologist is reporting. This lower sequence length, in general, seems to result in higher performances across all the tasks (as seen in
Appendix A). Even when pre-training the embedding in MLM, we trained with an input sequence of 32, which still outperformed classic BERT and BioClinical BERT trained on sequence lengths of 128 and then 512 for the final 10% of the iteration steps. Therefore, by using a smaller sequence length, the embeddings are more precise and can extract information more efficiently than when using longer sequences.
The major limitations of our project are as follows. We had a limited dataset, as this was a single institutional cohort of reports that were used to build the corpus, with a majority of them being MRI reports. Further validation on external datasets is necessary to assess generalizability. However, at present, public datasets do not exist for this specialized task. By publishing our code and embeddings, we hope to make it possible for other researchers to validate this pipeline on their own private datasets. Secondly, we chose to train the BI-RADS BERT embeddings from scratch in order to build a custom BERT embedding specialized in BI-RADS vocabulary, so the BERT embeddings were not initialized from a previous BERT embedding. Previous work suggests that double pre-training on varying datasets is highly efficient [
10]. Therefore, further analyses of the gains and losses from this implementation trade-off is needed. Thirdly, we could have evaluated the field extraction models to find information in identified sections that we did not use for field extraction during training. In some cases, information generally found in the history/Cl. Ind. will be in the findings or impression sections. Identifying previous cancer, menopausal status, or purpose for the exam may be possible by looking in another report section. Our fine-tuning dataset was built to not have these discrepancies, and would not be appropriate to evaluate this. Therefore, further work on this is necessary.
Domain shift is an ongoing research problem in radiology report analysis, as recording styles change through the years. For example, the BI-RADS lexicon is currently in its fifth edition (released in 2013), and it is possible that reports generated using the fourth edition, which was released in 2003, differ significantly. Our dataset spans a 15-year period, and the majority of reports were generated using the latest edition. However, it is possible that using exam date as an auxiliary data feature could improve field extraction or section segmentation.