Transformer is a neural network architecture similar to Feed-Forward Neural Network (FFNN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), etc. At this time, Transformer is the preferred architecture for building AI systems. Here, we explain why it performs so well, its limitations, and factors to consider when choosing a Transformer as the neural architecture. However, if you are interested in the underlying math, the seminal paper Attention is all you need is a great starting point.

Why does Transformer Perform so Well?

  • Transformers can see all past and future all at once! The alternative architectures have limited visibility (e.g., FFNN, CNN) or sequential visibility (e.g., RNN).
  • Transformers are attentive! Transformers use the Attention Mechanism, which enables them to focus on what matters at any given instance.
  • Transformer computation is extremely hardware-friendly. The underlying math is such that it can fully take advantage of the parallel processing capabilities of GPUs and modern CPUs. Therefore it is faster to train and infer using Transformer.

What are the Drawbacks of Transformers?

  • Transformer sees past and future all at once. But its runtime complexity grows quadratic as a function of the input length. For example, if processing a 1-second file with a Transformer model takes 1 second, processing a 10-second file takes 100 seconds.
  • Transformer doesn’t have a concept of order in time but can see all the past and future! There are workarounds.
  • Transformer architecture is not suitable for streaming real-time applications.

Transformer for a New Project

One should consider Transformer as a candidate architecture for a new project. Transformers have reached (passed) state-of-the-art in NLP, computer vision, and speech applications. Additionally, extensive software support exists for implementing, training, and deploying Transformers.

Transformer for an Existing Product

It depends on how well the baseline (existing model) is performing. Remember that beating a well-trained and tuned model can be a massive undertaking. Additionally, some of the requirements for the product (e.g., latency, memory usage, etc.) can be a showstopper for bringing Transformer onboard.