Large Language models are relatively new and capable of many things. Hence, standard metrics, like WER for speech-to-text, or scientific approach, like speech intelligibility for noise suppression, do not exist.

Some analysis has been using Perplexity to measure language model performance. Perplexity is familiar to those who work with NLP or speech recognition models. Perplexity determines how well a model predicts a sample of text. Hence, the lower the Perplexity, the better. However, Perplexity is not suitable to measure accuracy. A model may have low Perplexity but a high error rate.

A good benchmark for the language model performance evaluation should be holistic and cover various tasks, such as text completion, sentiment analysis, question answering, summarization, and translation. It should consider biases and hallucinations. A good test dataset should be diverse. This article discusses some common frameworks and how to approach language model evaluations.

How to Evaluate Large Language Models

Transparency and lack of standardization are the main issues in comparing language models, as in evaluating wake words, speech-to-text, or noise suppression engines. Different models may use different scenarios. Some vendors may choose test data that favor their models, or some may call their models the “best” without backing their claims.

Picovoice publishes and supports open-source benchmarks that bring transparency to the industry. Hence, we picked three open-source language model evaluation frameworks: HELM by Stanford CRFM, LM Evaluation Harness by Eleither AI, and Model Gauntlet by MosaicML.

1. Holistic Evaluation of Language Models (HELM) by Stanford CRFM:

HELM by Stanford CRFM covers 42 scenarios and 59 metrics. These scenarios include question answering, information retrieval, sentiment analysis, and toxicity detection. Metrics include accuracy (F1, ROUGE-2, etc.), robustness (F1 - perturbation typos, synonyms, etc.), and efficiency (observed inference runtime). Stanford CRFM introduced 21 new language model evaluation scenarios with HELM and continues to invest in it as language models evolve.

2. Language Model (LM) Evaluation Harness by Eleither AI:

LM Evaluation Harness by Eleither AI standardizes accuracy and reliability measures of language models. LM Evaluation Harness allows developers to test language models with minimal effort by unifying the frameworks, standardizing codebases, and minimizing repetitions. Currently, LM Evaluation Harness has more than 200 tasks, measuring the performance of language models through metrics F1, Word Perplexity, ROUGE-2, etc.

3. Model Gauntlet by MosaicML:

Model Gauntlet by MosaicML aggregates results from 34 test groups collected from different sources and grouped into six broad competency categories. Competency categories include world knowledge, reasoning, language understanding, problem-solving, reading comprehension, and programming. By aggregating results from various test groups, Model Gauntlet helps developers get more robust estimates for the overall performance of models rather than focusing on specific areas and tasks.

How to Approach Large Language Model Evaluations:

1. Define Primary Tasks and Success Metrics:

LLMs are generalist models used for different purposes. The frameworks mentioned above offer a generic comparison. However, each use case is unique, requiring a unique approach. To find the best language model for your use case, start working backward from the customer.

Determine the primary tasks, i.e., what matters for the customer, and pick metrics accordingly. For example, a product team building a translation app should weigh BLEU (Bilingual Evaluation Understudy) higher. Likewise, summarization applications should prioritize ROUGE (Recall-Oriented Understudy for Gissing Evaluation). World knowledge is not that crucial for domain-specific applications. However, it’s critical for general applications, such as ChatGPT.

2. Understand Model Capability and Limitations

Vendors often do not disclose training data sets. However, enterprises should understand the training data. For example, OpenAI shares that ChatGPT can be biased and discriminative, asking users to thumb down those responses to improve the model. OpenAI also explains how they developed ChatGPT, including the use of personal information. These can be a deal-breaker for enterprises prioritizing ethical AI. Big Tech has a controversial history of improving their services with user data despite jeopardizing user privacy. Understanding the language model's capability and limitations also enables users to craft well-defined prompts and achieve desired outcomes from models.

3. Develop a Holistic Approach

Accuracy, fairness, robustness, explainability, adaptability, integrations, model size, platform support, scalability, and ease of use are all crucial factors in choosing the language model that addresses user needs.

  • Accuracy compares model output vs. the expected results.
  • Fairness evaluates bias towards specific groups and outcomes with prejudice.
  • Robustness shows models’ proficiency to perform effectively across diverse conditions.
  • Explainability focuses on fostering user trust and ensuring model accountability.
  • Adaptability refers to fine-tuning language models. Foundation models may perform well out of the box but may not fit for domain-specific applications.
  • Integrations affect the development and iteration processes. Some vendors may offer easy-to-use APIs or SDKs for seamless integration, while some may not.
  • Model size determines the computational requirements, hence the number of platforms and the cost of training and inference.
  • Scalability ensures the volume of data the language model will process. It is more critical for streaming applications, such as real-time question and answering.
  • Ease of Use impacts the time and effort required for development and maintenance.

Next Steps:

Language model evaluation frameworks show the overall performance of the language models. However, the best approach for enterprises is to evaluate the models for their use cases, using their test data and adding their judgment. Some use cases, like summarization, require human evaluation more than others, as existing reference datasets for testing may not be comprehensive. If you’re unsure how to tailor language model evaluation frameworks for your application, engage with Picovoice Consulting and work with experts.

Consult an Expert