Audio sampling, or sampling, refers to the process of converting a continuous analog audio signal into a discrete digital signal. This is achieved by taking “snapshots”, i.e., samples, of the audio signal at regular intervals.

The sampling rate refers to the number of samples per second (or Hertz) taken from a continuous signal to make a discrete or digital signal. In simpler terms, it's how many times per second an audio signal is checked and its level recorded. The higher the sample rate, the more “snapshots” and the more detailed the digital representation of the sound wave.

Common Audio Sample Rates

There are several common sample rates used in digital audio. The choice of sample rate often depends on the intended use:

  • 96 kHz and 192 kHz: These high-definition rates are reserved for professional music production and certain streaming services catering to audiophiles.
  • 48 kHz: Adopted by the film and TV industry for clear audio syncing with video.
  • 44.1 kHz: The de facto standard for music CDs and most digital audio players, balancing quality with file size.
  • 16 kHz: Strikes a balance between quality and file size, used in voice commands and speech recognition technologies. (Yes, Picovoice engines require 16 kHz.)
  • 8 kHz: This low rate is used when bandwidth is limited, it’s typical for telecommunication systems like old-school phone calls.

Upsampling and Downsampling

Upsampling and downsampling are the processes of changing an audio file’s sample rate:

Upsampling: As the name suggests, upsampling is the process of increasing the sample rate. While it doesn't improve the original audio quality beyond its initial recording, it can make the audio compatible with systems that require a higher sample rate or improve the performance of certain digital audio processing effects.

Downsampling: Downsampling refers to the process of decreasing the sampling rate. This is typically done to reduce the file size of an audio signal or to make it compatible with another system. Downsampling can result in a loss of audio quality if not done correctly, as it involves the removal of data. A low-pass filter is typically employed before downsampling to prevent aliasing, the distortion that occurs when the signal reconstructed from samples differs from the original continuous signal.

16kHz is a popular sampling rate for several reasons. First and foremost, it strikes a balance between audio quality and file size. At 16kHz, audio files are small enough to store and transmit while offering reasonable audio quality. Secondly, the human voice's most critical frequencies lie between 300Hz and 3400Hz. The Nyquist-Shannon sampling theorem states that a sampling rate of at least twice the highest frequency is required for accurate signal representation. 16kHz is more than twice 3400Hz and sufficient for processing the human voice. That’s why 16kHz has become a standard in applications using human speech and voice.

Many telephone systems and Voice over Internet Protocol (VoIP) services use 16kHz because it captures the essential range of human speech while minimizing data usage. Voice AI applications, such as virtual assistants and dictation software, often use 16kHz as it provides sufficient quality for accurate speech analysis. Even some audiobook and podcast platforms use 16kHz to reduce file sizes and make content more accessible to users with limited bandwidth or storage.

Although 16kHz is the accepted industry standard for voice AI, Picovoice Consulting works with enterprise customers to optimize the engines or audio inputs when custom solutions are needed.

Consult an Expert