Momentum is building around natural language processing (NLP), as business and transformation leaders embrace the technology to increase scale, drive improvement and further their digital transformation objectives. However, the enterprise was not always so enthusiastic about NLP. It’s only in the last few years that NLP has become truly ready for business, displaying the performance and accuracy needed for critical business processes.
A big part of this success story has been the Transformer - a machine learning architecture that has greatly expanded the capabilities and performance of natural language understanding technology. Our resident senior research scientist Harshil Shah recently wrote a great technical rundown of Transformers, their development, and how to make them more efficient and performant.
What are Transformers?
If you were expecting Autobots or Decepticons, you may be disappointed.
In the field of machine learning, a Transformer is an important deep learning model architecture that uses the mechanism of self-attention to differentially weight the significance of each part of an input.
To put this more simply, Transformers are AI models able to take into account the relationship between all the different parts of a sequence - such as words in a sentence - before making a prediction or creating an output. This is all thanks to the self-attention technique, which gives transformers a longer ‘memory’ compared to other models, enabling them to consider far more points in a data set.
Why are Transformers important?
Introduced by Google Brain in 2017, Transformers have rapidly become the model of choice for NLP applications, rapidly replacing recurrent neural networks (RNNs) like long short-term memory. In many use cases, transformers have proven complementary to other neural architectures like RNNs and convolutional neural networks.
This shift towards Transformers has been driven largely by the innate advantages of the attention mechanism. This enables Transformers to look at all elements of a sequence at once, while also paying attention to the most important elements of the sequence. Previous models could usually do one or the other, but not both at the same time.
As a result, Transformers can be more accurate because they can understand the relationship between elements that are far from each other. They are also faster at processing a sequence since they pay greater attention to its most important components.
Transformers continue to be the focus of much research and innovation. Recent NLP models like Google’s BERT and OpenAI's GPT-3, have pushed landmark improvements in the accuracy, performance and usability of many natural language tasks - including understanding text, performing sentiment analysis, answering questions, and text generation.
These improvements have seen Transformers become pervasive in many business applications like machine language translation, conversational chatbots and, of course, in NLP platforms like Re:infer.
Building a better Transformer
The self-attention technique is at the heart of Transformer power and value for business applications. However, there’s a sting in the tail.
The Transformer’s self-attention mechanism has a high computational workload. In training the models to learn new concepts, and in using them to make predictions, Transformers can consume considerable resource - time, energy and money. Large transformer models have facilitated huge improvements in natural language understanding accuracy and performance. But there’s definitely a high barrier to entry.
Fortunately, there are ways to get over this hurdle. There are alternative attention mechanisms to self-attention that are more computationally efficient, without sacrificing performance
One way to reduce the compute load is by restricting how many words in the sequence the self-attention mechanism looks at. BlockBERT, for example, does this by segmenting sequences into chunks that only look at the words within one of the other chunks. The Longformer technique also restricts how many words the self-attention mechanism looks at. It does this by combining several simpler attention patterns.
In both methods, after a few layers have been completed, the entire sequence will have been considered by the Transformer model - faster and at lower cost.
Another method is to use low rank approximation. A Linformer approximates a full sequence down into a shorter sequence, on which the Transformer’s full attention can be applied. Compared to standard self-attention, for short sequences, the Linformer is up to 3 times faster and for long sequences, it is 3–20 times faster to run.
The Re:infer development and research teams often make use of these techniques to make the running of our Conversational Data Intelligence platform more efficient and less costly. Business and intelligent automation leaders will find these methods very useful in squeezing optimal value and performance from their own Transformer-based systems.
Read Harshil’s full blog for the complete technical rundown of these alternate attention mechanisms, and stay tuned for his next blog on how to develop more efficient Transformer approaches.