The pandemic has overwhelmed the healthcare system. Even before the pandemic, physician burnout was a problem, and administrative tasks, i.e. paperwork, have been the primary cause. Medical Dictation, addressing this problem, is one of the widely adopted speech solutions. Yet, recent advances make choosing and building Medical Dictation software challenging. Should you go with Nuance and take no risk or other Big Tech companies? How about innovative startups or open-source?

To help you answer this question, let’s look at what a Medical Dictation solution requires: HIPAA Compliance, High Accuracy and Fast Response Time.

HIPAA Compliance

First and foremost, Medical Dictation should be HIPAA-compliant. Enterprises using the cloud speech APIs need to ensure

  • Transmission & Infrastructure Security: The voice data should be encrypted, transferred securely and stored on secure servers.
  • Confidentiality: Anyone handling or accessing medical data should know how to work with PHI (Protected Health Information) and PII (Personally identifiable information)
  • Geo-Location And Geo-Fencing: Regulations may require voice data to be stored and processed in a specific geographic location.

There are several privacy-related questions enterprises should ask the automated transcription API providers. Some cloud providers offer HIPAA Compliance. For example, Google requires clients to talk to their account managers to execute a Business Associate Agreement (BAA) and not share their data with Google for training purposes. Not sharing the data with Google costs enterprises 50% more than using Google Speech-to-Text API and letting Google use their data. Google leaves the responsibility of building and executing HIPAA-compliant solutions to the clients. AWS also offers HIPAA-eligible transcription API, Amazon Transcribe Medical. AWS charges three times more for Amazon Transcribe Medical than the standard Amazon Transcribe models.

On the other hand, on-device automated transcription solutions address privacy and compliance needs and concerns by design. Voice data doesn’t leave the device. Thus, it’s not transmitted to a 3rd party server or stored. Nobody accesses data other than the owner.

Picovoice and Nuance speech-to-text models are examples of enterprise-grade; Kaldi and Open AI’s Whisper are open-source on-device automated transcription software.

Custom Models for High Accuracy After addressing compliance needs, on-device automated transcription solutions and selected cloud transcription APIs remain as options for Medical Dictation. Open-source speech recognition models are not highly Accurate for industry-specific jargon out of the box. To be fair, open-source models are generic by design and do not claim to understand industry-specific terminology. To Customize standard open-source models with medical jargon, enterprises need to evaluate their

  • Capabilities: Pharmaceutical names or diseases change and evolve. For example, Nuance, the most known vendor with 20+ years of experience in healthcare, published a content package for COVID-19. Nobody can foresee the future. Hence, enterprises need in-house machine learning expertise to keep the models up-to-date.
  • Resources: Training, running and maintaining large speech models require significant computing power. Most enterprises don’t have access to large server farms. Hence, they should consider whether they can acquire the needed resources physically or virtually.

Open AI’s Whisper excited many developers interested in medical transcription as it promises HIPAA Compliance by processing voice data on-device. However, Whisper’s parameter sizes range from 39 million to 1.6 billion. It means when enterprises need to add a new drug or disease name, just like the COVID-19 content, they need to re-train these large models.

Recruiting machine learning experts and acquiring large data sets and computing requirements are not easy nor affordable for every enterprise.


Humans are accustomed to having real-time responses in human-human interactions. There is no latency in our conversations.

  • Real-time Transcription: Medical Dictation should use automated transcription models that can handle real-time transcription. Models process data in batches are not a good fit for dictation use cases.
  • Predictable and minimal response time: Latency and unpredictable response time are inherent limitations of cloud computing. Using on-device speech models with old technology also has performance problems.

All reputable automated transcription model vendors offer streaming transcription models. Some open-source models, such as Whisper, do not. There are ways to pass small audio snippets to make them work. However, it’s still an Asynchronous process with delay. Yet, using a streaming model is not enough. For example, Amazon Medical Transcribe runs on Amazon’s servers, making its response time unpredictable since enterprises do not have visibility or control over Amazon’s or any other 3rd party’s servers.

The technology used for training speech models affects Performance, hence the Response Time. For example, Picovoice’s automated transcription solutions run across platforms, including web browsers. However, Nuance’s mobile solution cannot even process voice data locally on the mobile device because it’s heavy to run on a mobile device.

Picovoice technology uses the latest neural deep learning and offers a self-service web UI to train custom and accurate speech models with zero latency.

Are you interested in a HIPAA-compliant and accurate Medical Dictation solution powered by state-of-the-art on-device voice recognition technology? Customize Cheetah Streaming Speech-to-Text on Picovoice’s self-service Console for free! If you have a large volume of data, talk to enterprise sales!