Google Open-Sources ALBERT Natural Language Model
Google AI has open-sourced A Lite Bert (ALBERT), a deep-learning natural language processing (NLP) model, which uses 89% fewer parameters than the state-of-the-art BERT model, with little loss of accuracy. The model can also be scaled-up to achieve new state-of-the-art performance on NLP benchmarks.
The research team described the model in a paper to be presented at the International Conference on Learning Representations. ALBERT uses two optimizations to reduce model size: a factorization of the embedding layer and parameter-sharing across the hidden layers of the network. Combining these two approaches results in a baseline model with only 12M parameters, compared to BERT’s 108M, while achieving an average of 80.1% accuracy on several NLP benchmarks compared with BERT’s 82.3% average. The team also trained a “double-extra-large” ALBERT model with 235M parameters which performed better on benchmarks than the “large” BERT model with 334M parameters.
Recent advances in state-of-the-art NLP models have come from pre-training large models on large bodies of unlabeled text data using “self-supervision” techniques. However, the large size of these models, with hundreds of millions of parameters, present an obstacle to experimentation. Not only does training time and cost go up with model size, at some point the models are simply too large to train; they cannot fit in the memory of the training computers. While there are techniques to address this, the Google AI team has identified ways to reduce model size without sacrificing accuracy. With smaller models, the researchers can better explore the hyperparameter space of the models:
in order to improve upon this new approach to NLP, one must develop an understanding of what, exactly, is contributing to language-understanding performance — the network’s height (i.e., number of layers), its width (size of the hidden layer representations), the learning criteria for self-supervision, or something else entirely?
The first of ALBERT’s optimizations is a factorization of the word embeddings. ALBERT, like BERT and many other deep-learning NLP models, is based on the Transformer architecture. The first step in this model is to convert words to numeric “one-hot” vector representations. The one-hot vectors are then projected into an embedding space. A restriction of the Transformer is that the embedding space must have the same dimension as the size of the hidden layers. Projecting a vocabulary of size V into an embedding of dimension E requires VxE parameters. With the large vocabularies and model dimensions needed to achieve state-of-the-art results, this could require close to a billion parameters. By factorizing the embedding, the ALBERT team first projects the word vectors into a smaller-dimensional space: 128 vs BERT’s 768. Then this smaller embedding is projected into a higher-dimensional space that has the same dimension as the hidden layers. The team posits that the first projection is a context-independent representation of the word, while the second is context-dependent.
The second optimization is to share parameters across the network’s layers. Transformer network layers contain both a feed-forward component and an attention component; ALBERT’s strategy is to share each component across all layers. This does result in a loss of accuracy of about 1.5 percentage points, but it does reduce the number of parameters needed from 89M to 12M.
Google has released a TensorFlow-based implementation of ALBERT as well as models trained on an English-language corpus and a Chinese-language corpus; users on Twitter are now asking if Google has plans to release a model trained on a Spanish-language corpus. The ALBERT code and models are available on GitHub.