In medical research, structured data is crucial for reconstructing a patient’s history. However, some of these data are poorly or incompletely captured in hospital information systems. AliBERT, a pre-trained language model, enables structured data to be generated directly from medical reports, speeding up the costly, error-prone and time-consuming task of building databases based on chart reviews.
Using nearly 700 cancer patient files annotated by experts in the field, Quinten’s teams have trained AliBERT to recognize around ten key concepts in the follow-up of breast and lung cancer patients. In particular, it is now possible to efficiently detect – with performances (Accuracy) ranging from 80 to 95% – these concepts in several types of oncology reports (e.g., consultation, anatomopathology reports, Réunion de Concertation Pluridisciplinaire (RCP)).
Until now, traditional methods such as regular expression search or non-specialized neural networks were unable to process and exploit these complex and technical documents. The AliBERT tool, specialized in biomedical language, now automatically extracts all this information from a large number of heterogeneous documents.
A scientific publication is currently being written, reporting state of the art and progress achieved through this project. In the future, AliBERT will make it possible to reconstruct a patient’s care pathway, based on medical report databases set up in health data warehouses. An instrumental solution to accelerate research in the fight against cancer.