Large Language Models, especially the open-source ones, took the world by storm. They have been competitive against the commercial alternatives, especially once fine-tuned and optimized. They do not require enterprises to send their data to third-party remote servers, protecting user privacy.
In this article, we’ll review the top open-source pre-trained large language models: LLaMA by Meta, Mistral 7B by Mistral, Falcon LLM by TII, GPT-2 by OpenAI, GPT-J by EleutherAI, MPT by MosaicML, and BLOOM by BigScience.
The below list was prepared as of October 2023 by using HuggingFace’s Open LLM Leaderboard , which changes frequently as models evolve.
In February 2023, Meta released the first version of LLaMA, claiming that LLaMA with 13B parameters outperforms GPT-3 with 175B parameters on many NLP benchmarks. Meta developed the first version of LLaMA for research purposes and released it under a noncommercial license requiring researchers to fill out a form. However, it got leaked in two weeks . Later in July, Meta released LLaMA-2 trained on 40% more data than LLaMA and doubled the context length. It released fine-tuned versions: LLaMA 2-Chat, optimized for conversations, and LLaMA Code , optimized for code generation. This time, META removed the restrictions on the models and allowed commercial usage , as well.
Alpaca, Alpaca-LoRA, Koala, QLoRA, llama.cpp, Vicuna, StableBeluga, Giraffe, and Vigogne are some popular derivations of LLaMA developed by universities and enterprises.
Mistral is a Paris-based startup founded by former Meta and Google researchers. The company released its first large language model with 7B parameters in September 2023. The announcement included that Mistral 7B outperformed all existing open-source large language models up to 13B parameters on all standard English and code benchmarks . Mistral released the model under the Apache 2.0 license without any restrictions on use or reproduction.
Falcon LLM was released by the Technology Innovation Institute based in Abu Dhabi. The initial model with 40B parameters gained popularity among researchers and developers in days as the model was released with weights for both research and commercial purposes. Researchers claim that although Falcon LLM with 180B parameters is slightly behind GPT-4 by OpenAI, it outperforms LLaMA 2 by META and performs on par with Google's PaLM 2 Large.
EleutherAI is a non-profit research institute. It released GPT-J in 2021. At that time, GPT-J was the largest publicly available GPT-3 model in the world. It was a competitive alternative to OpenAI’s GPT-3. The model was trained on the Pile dataset . The dataset consists of ~900GB of diverse English text data for LLM training and was open-sourced by EleutherAI.
Databricks fine-tuned GPT-J on Stanford Alpaca corpus and released Dolly in March 2023, proving that even old and small models with the right fine-tuning can achieve competitive results.
Later, EleutherAI released Pythia , and Databricks released the second version of Dolly. Dolly 2 costs $30 to fine-tune on an instruction-following dataset crowdsourced among Databricks employees. Later Databricks also open-sourced the training code, the dataset, and the model weights for non-commercial and commercial use.
Transformers is the most common architecture in developing large language models.
MPT is another open-source GPT model that is available for commercial use. MosaicML released the first version with 7B parameters in May 2023 and claimed that it performed on par with LLaMA . MPT leverages the latest techniques in LLM modeling: Flash Attention for efficiency, Alibi for context length extrapolation, and stability improvements to mitigate loss spikes. A month later in 2023, Mosaic ML released MPT with 30B parameters and Databricks acquired MosaicML for $1.3B .
BLOOM stands for BigScience Large Open-science Open-access Multilingual Language Model, released in July 2022. BLOOM was started by HuggingFace and completed after the involvement of over 1,000 AI researchers. It aims to enable public research on large language models. Hence, the intended users are non-commercial entities. BLOOM-LoRA and Petals are reproductions of BLOOM.
GPT-2 is the oldest model in this list. It was released in 2019 and superseded by GPT-3 & GPT-4 models, which are not open source. Initially, GPT-2 was not reproducible as OpenAI did not release the fully-trained model or the training corpora, arguing the concerns of misuse and abuse. Later, OpenAI released the full version as they didn’t see any evidence of misuse .
If you’re looking for more comparisons, do not forget to check speech-to-text, natural language understanding, noise suppression, speaker diarization, and text-to-speech engine comparisons, or engage with Picovoice Consulting to find the best solution for your needs.Consult an Expert