AliBERT : the first pretrained language model for French biomedical text

2023, 31 May

| < 1 min read|

quinteninfra

The thumbnail for the article ALIBERT: A PRETRAINED LANGUAGE MODEL FOR FRENCH BIOMEDICAL TEXT published in Hal Science

The paper “AliBERT : A pretrained language model for French biomedical text” was written in collaboration with Aman Berhe, Guillaume Draznieks, Vincent Martenot, Valentin Masdeu, Lucas Davy and Jean-Daniel Zucker.

BERT architecture, which allow for context learning on text documents, is mostly trained on common English text resources.
Performances in other languages, especially in specific topics which requires deep knowledge and vocabulary, are usually below human standards.

This article presents AliBERT’s design and compares different learning strategies. This model is trained using regularized Unigram based tokenizer trained for this purpose. You will learn about the objectives, methods, results and applications of this new pretrained language model. Which will allow you to understand why AliBERT is a solution for high performance French biomedical natural language processing.