A Natural Language Processing (NLP) Approach to Automate Patients’ Testimonials Analysis.

2022, 24 November

| 2 min read

Authors: P. Hayat, C. Clémente, V. Martenot, M. Rollot

Date: 24 November 2022

N° PCR172, 2022-11, ISPOR Europe 2022, Vienna, Austria
Value in Health, Volume 25, Issue 12S (December 2022)



Patients’ testimonials (e.g. posts on forums or responses to questionnaires) provide valuable insights to define and characterize patient-reported outcomes (PRO), quality of life and patients’ perspective of disease symptoms. However, traditional NLP methods used for automated analysis of patients’ testimonials are based on co-occurrence word frequency, and thus are not fit-for-purpose for such data, with short texts and rare co-occurrences. Building upon an efficient method based on semantic proximity we introduced recently, the objective is to improve results post-processing and method scalability.


First, testimonials are vectorized to embeddings with a pre-trained Sentence-BERT language model, capturing the meaning of the texts beyond simple word co-occurrence. To ease interpretation, embeddings dimensionality is reduced to two using the UMAP algorithm. Then, an agglomerative clustering is performed on new embeddings with an optimal number of clusters (based on silhouette scores). In addition to previous work, the clustering dendrogram facilitates post-processing interventions by automatically pre-selecting the clusters that can be merged together or split into two subclusters. The most prevalent terms in a cluster are used to label it. Sentiment analysis is also performed to refine tags and ensure clusters’ definition relevance.


Tested on patients’ testimonials of an average length of 15 words, our method provides more consistent and interpretable topics than state-of-the-art approaches (e.g. latent Dirichlet allocation, non-negative matrix factorization). Compared to previous work, the improved clustering post-processing makes the analysis pipeline much faster to execute and more scalable, without altering performance.


Our proposed method allows to extract more consistent topics from a large volume of short texts in a more automated and less time-consuming way. It provides stronger insights on patients’ perception about a wide range of healthcare topics (side effects, treatment, symptoms…), paving the way for better PRO definitions and patient-centric evaluation, and striving better adherence to treatments.