What We Do

Machine Learning

ArcTEX is our sophisticated NLP model designed to cater to a diverse array of clinical reports. Our innovative solution empowers real-world evidence studies by seamlessly automating the extraction of biomarkers and other disease-specific data at scale.

About ArcTEX

At Arcturis, our team of researchers have pioneered the development of ArcTEX (Arcturis Text Enrichment and Extraction) model. ArcTEX is a flexible Natural Language Processing (NLP) framework, engineered to systematically extract biomarker and disease-specific data from unstructured clinical reports. Notably, ArcTEX stands out for its versatility, and is capable of being easily finetuned to cater to diverse project needs.

ArcTEX Overview and Introduction

Data enrichment using unstructured clinical reports

Numerous studies have demonstrated the vast reservoir of clinical insights hidden within unstructured textual data, such as clinical letters or pathology reports (1). Unfortunately, this highlights the volume of data unavailable for direct analysis. At Arcturis, our team of researchers have pioneered the development of ArcTEX (Arcturis Text Enrichment and Extraction) model. ArcTEX is a flexible Natural Language Processing (NLP) framework, engineered to systematically extract biomarker and disease-specific data from unstructured clinical reports. Notably, ArcTEX stands out for its versatility, and is capable of being easily finetuned to cater to diverse project needs. Moreover, our model has been meticulously optimized to underpin high-quality real-world evidence (RWE) initiatives, ensuring robust and reliable outcomes across multiple disease areas.

Introduction

Many leading pharmaceutical companies enhance their clinical development and post-market launch strategies through the integration of real-world data. These can encompass a spectrum of methodologies, ranging from retrospective cohort studies to the optimisation of patient selection criteria or the incorporation of external control arms. However, a significant hurdle lies in the fact that a substantial portion of vital healthcare data required for these initiatives reside within unstructured textual formats, impeding direct accessibility for analysis.

If we consider, for instance, critical biomarker statuses like human epidermal growth factor receptor-2 (HER2) or oestrogen and progesterone receptors (ER or PR). These biomarkers have a profound influence on the treatment trajectory for breast cancer patients. This crucial information often finds itself embedded within unstructured pathology reports, characterized by variations in style and content across different hospital sites and pathologists. This also applies to, for example, nuances regarding ‘response to treatment’ or ‘disease progression’, which further exemplify the breadth of which unstructured data can crucially inform treatment pathways.

Unlocking these insights presents a formidable yet essential challenge in advancing pharmaceutical research and patient care. To meet the evolving needs of these communities, Arcturis are proud to introduce ArcTEX, a sophisticated NLP model designed to cater to a diverse array of clinical reports. Our innovative solution empowers real-world evidence studies by seamlessly automating the extraction of biomarkers and other disease-specific data at scale. ArcTEX stands as a testament to our commitment to advancing healthcare research through cutting-edge technology, to ensure that the availability of high-quality data is at the forefront of evidence decision-making.

Case Study 1 - Extraction of oncology markers

Our approach in the creation of our innovative model is based on recent developments in the field of machine learning and natural language processing, utilising transformer-based language models. In contrast to other large language models (LLMs), ArcTEX significantly benefits in not suffering from hallucinations and providing additional confidence scores for each extracted value. The model is also optimised on UK specific data for a range of clinical markers. Further optimisation is also possible through our iterative optimisation and validation framework, which can provide insights in the robustness of ArcTEX.

As an example, the figure below shows the results of ArcTEX on the detection and extraction of information over a wide range of markers. With an average accuracy of 98.3%, ArcTEX shows the ability to extract multiple markers from free unstructured text reports over both histological and genetic reports. Additionally, ArcTEX is able to extract both numeric and text-based values where both are used interchangeably, for example with oestrogen or progesterone receptors amongst others. The versatility of ArcTEX makes it suited to marker extraction from free text reports, extracting results at an accuracy comparable or greater than that of a human annotator (2), at a greatly increased rate of extraction.

Case Study 2 - Model Comparison

In comparison to other available transformer-based language models, such as RoBERTa, BioBERT, or large language models (LLMs) such as Meta’s Llama2 and Llama3 models, ArcTEX can be shown to have superior performance in marker extraction. In the figure below, we can see the average overall accuracies of the afore mentioned models compared to ArcTEX on the left, with the performance across specific markers displayed on the right. In addition to greater overall performance, ArcTEX also provides insights into the model performance. Unlike with LLMs, for each evaluated free text report ArcTEX will generate scores based on its confidence that it has identified the correct value. By exploring these confidence scores, reports can be automatically identified which have a risk of having been misclassified. These reports can be flagged for either manual review or excluded from the analysis, resulting in an overall accuracy increase.  

Compared to other LLMs, ArcTEX provides multiple advantages, making it very well placed for a range of RWE studies from a scientific and practical standpoint, such as:

High accuracy: the model is optimised to extract biomarker and disease-specific data from clinical free text reports which makes the model superior compared to many generic LLMs. The degree of otherwise unavailable granular data that can be extracted and analysed using our model is key to be able to generate useful scientifically robust epidemiological country-specific data to speed up reimbursement and patient access.
High flexibility: the developed optimisation framework can be used to further optimise ArcTEX on other biomarkers depending on project requirements within hours, which can be highly beneficial from both a cost and efficiency perspective.
Transparency: ArcTEX provides confidence scores for each evaluated report, allowing quick identification of reports which might require further manual review. Furthermore, the optimisation framework provides insight into the accuracy and variability of the data extraction. This can be used to provide evidence in the robustness of the methodology to support studies.
Low resource requirements: our approach does not require availability of expensive compute infrastructure like GPU’s. ArcTEX can be executed on a standard computer, which increase the flexibility to deploy our data enrichment tool in different hospital environments.

Case Study 3 - Leveraging Clinical Insights for Enhanced NLP Performance

Mismatch Repair (MMR) proteins—MLH1, MSH2, MSH6, and PMS2—play a crucial role in correcting DNA replication errors. Mutations in these proteins can lead to MMR deficiency (dMMR), contributing to genomic instability and impacting treatment decisions in cancers such as endometrial and colorectal cancer. However, identifying patients with dMMR from real-world data is challenging due to the variability in how MMR and its four sub-markers are documented in free-text pathology reports. Reporting practices vary significantly across pathologists, hospitals, and healthcare systems. Some reports provide detailed information on each individual protein, others simply state the overall MMR status as deficient (dMMR) or proficient (pMMR), and some use a combination of these approaches.

Naïve methods, such as using basic search techniques like regular expressions, often struggle to effectively capture this variability, resulting in a limited number of patients being correctly identified for analysis.

In a recent analysis of an oncology dataset contained within the Arcturis real world data. ArcTEX was able to identify 20% more reports with actionable MMR status compared to a naïve implementation by considering both the overall MMR status and all four submarkers. As shown below, ArcTEX was able to identify 1,606 reports containing MMR information, compared to 1,336 reports using a naïve approach. Additionally, ArcTEX can filter out reports where clinical features are mentioned in a non-actionable context, such as “not applicable,” “requested,” or “to follow.”

Of these 1,606 reports, 51.9% contained both the overall MMR status and details about individual submarkers, 31.4% mentioned only the overall MMR status, and 16.6% listed the individual submarkers without specifying the overall status. This variability highlights the complexities of MMR reporting, even within a single dataset.

For reports that include both, the overall MMR status and individual submarkers, ArcTEX further enhances data quality by cross validating this information. As illustrated in the figure above, there is a 97.8% agreement between explicitly stated MMR status and the status inferred from the individual submarkers when both are available. The remaining 2.2% of discrepancies are flagged for manual review. This process allows for a more efficient and accurate analysis of free-text oncology data, ultimately enhancing data reliability and providing deeper insights into oncology research.

Conclusion

In summary, ArcTEX emerges as a pinnacle of innovation in the realm of NLP, offering flexibility and precision in the extraction of clinical features and supplementary disease information from an expansive spectrum of unstructured clinical text. Due to its remarkable capacity to quickly evaluate a vast volume of reports with unparalleled accuracy, as well as integrated robustness matrices and confidence scores, it is an indispensable asset for high-quality RWE generation.

Opportunities for RWE generation exist throughout the entire development cycle, from informing internal strategy (e.g. early phase target product profile development), all the way to late-stage external control arms, playing a crucial role in regulatory decision making. ArcTEX can be utilised across this entire cycle to unlock the true potential of unstructured data and overcome barriers to analysis, ensuring robust and reliable outcomes for a variety of stakeholders, across multiple disease areas.

References

(1) Hyoun-Joong Kong, Managing Unstructured Big Data in Healthcare System, Healthcare Informatics Research, 2019

(2) Marie-Pier Gauthier et al.: Automating Access to Real-World Evidence JTO Clin Res Rep, 2022

Explore More

What We Do

Real-World Data

Research-ready, regulatory grade data sets producing powerful insights for the clinical development pathway.

Resources

Discover our case studies, news and publications.