Links: Generative AI picture display
Models
Whisper
Whisper is a is an open-source model of automatic speech recognition (ASR) that converts spoken language into written text, it was developed by OpenAI.
The process works as follows:
- Input audio is split into 30-second chunks and converted into log-Mel spectrograms.
- The encoder processes these spectrograms to extract speech patterns and features.
- The decoder predicts the corresponding text, incorporating special tokens for various tasks.
It offers multilingual support and translation to English.
For automatic activation with a wake word, a bit more setup is needed, see this discussion on github.
For detectic when a command ends Voice Acticity Detection (VAD) is also needed.
whisper.cpp
whisper.cpp is a port of OpenAI’s Whisper model in C/C++
It allows for better performance, and faster answers
It runs on raspberry pi with it’s smaller models pretty well
They provide a basic voice assistant example that implements a wake word