Links: Generative AI picture display

Models

Whisper

Whisper is a is an open-source model of automatic speech recognition (ASR) that converts spoken language into written text, it was developed by OpenAI.
The process works as follows:

  • Input audio is split into 30-second chunks and converted into log-Mel spectrograms.
  • The encoder processes these spectrograms to extract speech patterns and features.
  • The decoder predicts the corresponding text, incorporating special tokens for various tasks.

It offers multilingual support and translation to English.

For automatic activation with a wake word, a bit more setup is needed, see this discussion on github.
For detectic when a command ends Voice Acticity Detection (VAD) is also needed.

whisper.cpp

whisper.cpp is a port of OpenAI’s Whisper model in C/C++
It allows for better performance, and faster answers
It runs on raspberry pi with it’s smaller models pretty well
They provide a basic voice assistant example that implements a wake word