Intent inference (detection) from spoken commands is at the core of any modern voice user interface (VUI). Typically, the spoken commands are within a well-defined domain of interest, i.e. Context. For example:

  • Play Tom Sawyer album by Rush [Music]
  • Search for sandals that are under $40 [Retail]
  • Call John Smith on his cell phone [Conferencing]

Why End-to-End?

Current solutions work in two steps. First, a speech-to-text (STT) transcribes speech to text. Second, a natural language understanding (NLU) engine detects the user's intent by analyzing the transcribed text. NLU implementations can vary from grammar-based regular expressions to complex probabilistic (statistical) models or a mix. This approach requires significant compute resources, to process an inherently large-vocabulary STT. Moreover, it gives a suboptimal performance as errors introduced by STT impair NLU performance.

Picovoice Rhino Speech-to-Intent engine takes advantage of contextual information to create a bespoke jointly-optimized STT and NLU engine for the domain of interest. The result is an end-to-end model that outperforms alternatives in accuracy and runtime efficiency. Additionally, it can run on-device and offline, helping privacy, latency, and cost-effectiveness.

Accuracy

We have benchmarked the accuracy of Picovoice Speech-to-Intent against Amazon Lex, Google Dialogflow, IBM Watson, and Microsoft LUIS. The test scenario is VUI for a voice-enabled coffee maker. The detail of the benchmark is available on the Rhino benchmark page. This significant accuracy improvement stems from the joint optimization performed when training the end-to-end intent inference model.

Accuracy comparison of different conversational AI platforms: Amazon Lex, Google Dialogflow, IBM Watson, Microsoft LUIS, and Picovoice Rhino Speech-to-Intent

Another open-source benchmark shows that Picovoice Rhino also beats Amazon Alexa. The study also evaluates other independent voice technology providers such as Houndify by SoundHound and Snips and mentions the limitations of each of these platforms.

Cost Effectiveness

The API-based pricing becomes unbearably expensive as you gain user traction. For example, Google Dialogflow charges $0.0065 per API call. A voice-enabled application with 4 voice interactions per day costs $9.49 per user per year. 20 voice interactions a day costs $47.45, and 100 voice interactions a day pushes the cost up to $237.25. This unbounded operating cost can be prohibitive for device builders and app developers.

Picovoice Rhino Speech-to-Intent offers unlimited voice interactions per user for a flat fee.

The figure below provides a more comprehensive comparison of Picovoice vs Amazon Lex, Microsoft LUIS, Google Dialogflow, and IBM Watson.

Pricing comparison of different conversational AI platforms: Amazon Lex, Google Dialogflow, IBM Watson, Microsoft LUIS, and Picovoice Rhino Speech-to-Intent

Latency

Latency is an inherent limitation with cloud-based solutions, as the total delay is unpredictable and depends on the quality of the network connection and server-side load. Rhino offers reliable response time to detect users' intent.

Start Building

Start building with Rhino Speech-to-Intent for free.

rhino = pvrhino.create(
access_key,
context_path)
while not rhino.process(audio_frame()):
pass
inference = rhino.get_inference()
Build with Python
let rhino = new Rhino(
accessKey,
contextPath);
while (!rhino.process(audioFrame())) { }
let inference = rhino.getInference();
Build with NodeJS
RhinoManager rhinoManager = new RhinoManager.Builder()
.setAccessKey(accessKey)
.setContextPath(contextPath)
.build(
appContext,
new RhinoManagerCallback() {
@Override
public void invoke(RhinoInference inference) {
// Inference callback
}
}
);
rhinoManager.start()
Build with Android
let rhinoManager = RhinoManager(
accessKey: accessKey,
contextPath: contextPath,
onInferenceCallback: { inference in
// Inference callback
});
try rhinoManager.start()
Build with iOS
const {
inference,
contextInfo,
isLoaded,
isListening,
error,
init,
process,
release,
} = useRhino();
useEffect(() => {
// Inference callback
}, [inference]);
await init(
accessKey,
context,
model
);
await process();
Build with React
RhinoManager rhinoManager = await RhinoManager.create(
accessKey,
contextPath,
(inference) => {
// Inference callback
});
await rhinoManager.process()
Build with Flutter
let rhinoManager = await RhinoManager.create(
accessKey,
contextPath,
(inference) => {
// Inference callback
});
await rhinoManager.process()
Build with React Native
RhinoManager rhinoManager = RhinoManager.Create(
accessKey,
contextPath,
(inference) => {
// Inference callback
});
rhinoManager.Start();
Build with Unity
constructor(private rhinoService: RhinoService) {
this.inferenceDetection = rhinoService.inference$.subscribe(
inference => {
// Inference callback
}
)
}
async ngOnInit() {
await this.rhinoService.init(
accessKey,
context,
model
)
}
Build with Angular
{
data() {
const {
state,
init,
process,
release
} = useRhino();
init(
accessKey,
context,
model,
);
return {
state,
process,
release
}
},
watch: {
"state.inference": function(inference) {
if (inference !== null) {
// Inference callback
}
}
}
}
Build with Vue
Rhino rhino = Rhino.Create(
accessKey,
contextPath);
while (rhino.Process(AudioFrame())) { }
Inference inference = rhino.GetInference();
Build with .NET
Rhino rhino = new Rhino.Builder()
.setAccessKey(accessKey)
.setContextPath(contextPath)
.build();
while (!rhino.process(audioFrame())) { }
RhinoInference inference = rhino.getInference();
Build with Java
rhino := NewRhino(
accessKey,
contextPath)
err := rhino.Init()
for {
isFinalized, err := rhino.Process(AudioFrame())
if isFinalized {
break
}
}
inference, err := rhino.GetInference()
Build with Go
let rhino: Rhino =
RhinoBuilder::new(
access_key,
context_path
)
.init()
.expect("");
loop {
if let Ok(is_finalized) = rhino.process(&audio_frame()) {
if is_finalized {
if let Ok(inference) = rhino.get_inference() {
// Inference callback
}
}
}
}
Build with Rust
pv_rhino_init(
access_key,
model_path,
context_path,
sensitivity,
require_endpoint,
&rhino);
while (true) {
pv_rhino_process(
rhino,
audio_frame(),
&is_finalized);
if (is_finalized) {
pv_rhino_get_intent(
rhino,
&intent,
&num_slots,
&slots,
&values);
}
}
Build with C