Transformer is a neural network architecture similar to Feed-Forward Neural Network (FFNN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), etc. At this time, Transformer is the preferred architecture for building AI systems. Here, we explain why it performs so well, its limitations, and factors to consider when choosing a Transformer as the neural architecture. However, if you are interested in the underlying math, the seminal paper Attention is all you need is a great starting point. 
Why does Transformer Perform so Well?
- Transformerscan see all past and future all at once! The alternative architectures have limited visibility (e.g.,- FFNN,- CNN) or sequential visibility (e.g.,- RNN).
- Transformersare attentive!- Transformersuse the- Attention Mechanism, which enables them to focus on what matters at any given instance.
- Transformercomputation is extremely hardware-friendly. The underlying math is such that it can fully take advantage of the parallel processing capabilities of GPUs and modern CPUs. Therefore it is faster to train and infer using Transformer.
What are the Drawbacks of Transformers?
- Transformersees past and future all at once. But its runtime complexity grows quadratic as a function of the input length. For example, if processing a 1-second file with a- Transformermodel takes 1 second, processing a 10-second file takes 100 seconds.
- Transformerdoesn’t have a concept of order in time but can see all the past and future! There are workarounds.
- Transformerarchitecture is not suitable for streaming real-time applications.
Transformer for a New Project
One should consider Transformer as a candidate architecture for a new project. Transformers have reached (passed) state-of-the-art in NLP, computer vision, and speech applications. Additionally, extensive software support exists for implementing, training, and deploying Transformers.
Transformer for an Existing Product
It depends on how well the baseline (existing model) is performing. Remember that beating a well-trained and tuned model can be a massive undertaking. Additionally, some of the requirements for the product (e.g., latency, memory usage, etc.) can be a showstopper for bringing Transformer onboard.







