exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources

exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources

Abstract

We introduce exBERT, a training method to extend BERT pre-trained models from a general domain to a new pre-trained model for a specific domain with a new additive vocabulary under constrained training resources (i.e., constrained computation and data). exBERT uses a small extension module to learn to adapt an augmenting embedding for the new domain in the context of the original BERT’s embedding of a general vocabulary. The exBERT training method is novel in learning the new vocabulary and the extension module while keeping the weights of the original BERT model fixed, resulting in a substantial reduction in required training resources. We pre-train exBERT with biomedical articles from ClinicalKey and PubMed Central, and study its performance on biomedical downstream benchmark tasks using the MTLBioinformatics-2016 dataset. We demonstrate that exBERT consistently outperforms prior approaches when using limited corpus and pretraining computation resources.

Publication
Findings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020)