Skip to content

TensorFlow Lite MicroSpeech

Phil Schatzmann edited this page Oct 5, 2024 · 43 revisions

The staring point for doing speech recognition on an Arduino based board is TensorFlow Light For Microcontrollers with the example sketch called micro_speech!

I have adapted the MicroSpeech example from TensorFlow Lite to follow the philosophy of this framework. The example uses a Tensorflow model which can recognise the words 'yes' and 'no'. The output stream class is TfLiteAudioOutput. In the example I am using an ESP32 AudioKit board, but you can replace this with any type of processor with a microphone.

The Arduino Sketch

Here is the complete Arduino Sketch:

#include "AudioTools.h"
#include "AudioTools/AudioLibs/AudioKit.h"
#include "AudioTools/AudioLibs/TfLiteAudioStream.h"
#include "model.h"  // tensorflow model

AudioKitStream in;  // Audio source
TfLiteAudioStream tfl;  // Audio sink
const char* kCategoryLabels[4] = {
StreamCopy copier(tfl, in);  // copy mic to tfl
int channels = 1;
int samples_per_second = 16000;

// Command callback handler
void respondToCommand(const char* found_command, uint8_t score,
                      bool is_new_command) {
  if (is_new_command) {
    char buffer[80];
    sprintf(buffer, "Result: %s, score: %d, is_new: %s", found_command, score,
            is_new_command ? "true" : "false");

void setup() {
  AudioLogger::instance().begin(Serial, AudioLogger::Warning);

  // input from Audiokit microphone
  auto cfg = in.defaultConfig(RX_MODE);
  cfg.input_device = AUDIO_HAL_ADC_INPUT_LINE2;
  cfg.channels = channels;
  cfg.sample_rate = samples_per_second;
  cfg.use_apll = false;
  cfg.auto_clear = true;
  cfg.buffer_size = 512;
  cfg.buffer_count = 16;

  // output to tensorflow
  auto tcfg = tfl.defaultConfig();
  tcfg.channels = channels;
  tcfg.sample_rate = samples_per_second;
  tcfg.kTensorArenaSize = 10 * 1024;
  tcfg.respondToCommand = respondToCommand;
  tcfg.model = g_model;

void loop() { copier.copy(); }

The key information that needs to be provided as configuration to tensorflow are

  • number of channels
  • sample rate
  • kTensorArenaSize
  • a callback for handling the responses (respondToCommand)
  • the model
  • the labels

Like in any other audio sketch, we just need to copy the data from the input to the output class.

Overall Processing Logic

The TfLiteAudioOutput class uses Fast Fourier transform (FFT) to calculate the FFT result which is an array of frequencies with the length of kFeatureSliceSize using slices (defined by kFeatureSliceStrideMs and kFeatureSliceDurationMs) of audio data. This is then used to update a spectrogram (with the length of kFeatureSliceSize x kFeatureSliceCount). After we added 2 (kSlicesToProcess) new FFT results to the end, we let Tensorflow evaluate the updated spectrogram to calculate the classification result. These results are post-processed (in the TfLiteRecognizeCommands class) to make sure that the result is stable.


Building the Tensorflow Model

Here is the relevant Jupyter workbook. I am also providing the necessary files to run it in Docker. Just execute docker-compose up and connect to http://localhost:8888.


The full example can be found on Github

Clone this wiki locally