Embedding methods, such as word2vec and BERT have contributed to stunning leaps in NLP performance and reliability. However, businesses should be aware of the potential pitfalls of embeddings before embarking to train their own language models.
Natural language processing (NLP) models have achieved some stunning advancements in the last few years. The technology has comfortably made the transition from ‘cool’ to ‘useful’ as businesses embrace NLP in real-world case studies - everything from translation to Communications Mining. NLP is now embedded across business, from the back office to front-of-line services.
But what has brought about this shift in reliability and performance? It’s due to the adoption of the Transformer as the model of choice for NLP, and improvements to the datasets they train on.
Our resident research scientist Harshil Shah recently wrote an informative post on embeddings, explaining their history and how some of the most common embedding methods can be improved. Read his article on our Engineering Blog for a more technical deep-dive on the topic.
What is an embedding?
An embedding is a vector of numbers used in machine learning to represent a word or phrase. Words are first encoded into vectors to make them readable and understandable for computers. It is through the embedding training process that AI models learn the meaning or intent behind such words.
By far the most popular embeddings are known as ‘lower-dimensional dense embeddings’. These are embeddings that can help models to understand when two or more words are similar to each other in some way - like ‘apples’ and ‘oranges’.
Why are embeddings important to NLP?
Embeddings are crucial to any NLP task that depends on understanding the meaning or intent of human language. If the embeddings used aren’t informative to the desired task, then the model won’t be able to make accurate predictions.
Embeddings are very important for popular NLP use cases including classification, sentiment analysis and semantic search. Without the use of embeddings to understand the semantic meaning of words such tasks wouldn’t be possible.
What are the weaknesses of embedding methods?
Embeddings have revolutionised NLP research and have done much to bring the technology into the mainstream for commercial use. However, researchers are still working to resolve some of the innate drawbacks of embedding methods.
Here are the most important for businesses to be aware of:
Bias and discrimination
Embedding methods require immense training datasets often taken unfiltered from the web. Realistically, few businesses could ever hope to properly audit these datasets before training begins. This means that embeddings are susceptible to learning and enforcing negative stereotypes and harmful ideologies. Outputs from such models could prove offensive or even discriminatory when important decisions are based off of them.
While research is actively being done to debias language models, businesses should carefully consider how biased models could warp their intended use case. They are also advised to add a human control to supervise any outputs.
Monetary and environmental cost
Most modern embedding methods demand training models with billions of parameters on clusters of thousands of GPUs for long periods of time. Most businesses lack the resources to cover the energy costs required. However, it’s also important to consider the environmental impact of training these massive models.
In most cases, businesses shouldn’t attempt to train their own NLP models. Instead, they are advised to leverage the models and services of experienced technology providers, with the expertise to make the training process more efficient.