Dec 22

Web Speech API

Learn how to listen to and respond to speech in your application using the Web Speech API.

By Stephanie Eckles

Did you know your browser has speech recognition capabilities? (Well, except Firefox for now).

Similar to the MediaStreams API we learned on Day 6, this API will trigger a prompt to get the user's permission to listen to their microphone. Then we can set up logic to recognize words to send back a response within a chat-like interface.

To help us learn this API, we'll be building a "Chat with Santa" application.

Intro to the Web Speech API

The API has two kinds of functionality - recognition and synthesis (aka text-to-speech, aka speaking text aloud). It's important to know that current implementations in Chromium and Safari/webkit do not allow offline access so the service would not be available for progressive web apps (PWAs).

Additionally, the API is technically in experimental status, which you'll see reflected in the MDN browser compatibility table. But we're here to celebrate what the web is capable of, and this is certainly still an impressive bit of functionality!

"Chat with Santa" features

For our voice-enabled chat with Santa, we'll need the following functionality:

Chat application HTML and CSS

As we've done throughout the series, we'll start with the markup and styles.

First up, our HTML, which is quite simply a container, headline, button to start and stop the listening, and an unordered list to populate with the chat messages.

<div class="santa-chat-app">
<h2>Chat with Santa</h2>
<button class="santa-listen" data-state="inactive">Start listening</button>
<ul class="santa-chat"></ul>
</div>

We'll be using mostly flexbox to align our chat elements, and just a touch of flourish provided by a pseudo-element and box-shadow to make the chat messages stand out.

There are two bits of state we'll style for related to when the app is listening for speech. We'll update the color of the button, and add a duo-tone green box-shadow to the chat log as an extra visual cue that listening is active.

The style demo has added the active class to preview that style.

Santa chat styles
.santa-chat-app {
--santa-red-hue: 10;
--santa-green-hue: 120;
--santa-saturation: 90%;
--santa-red: hsl(var(--santa-red-hue), var(--santa-saturation), 45%);
--santa-green: hsl(var(--santa-green-hue), var(--santa-saturation), 35%);
--sanata-green-light: hsl(var(--santa-green-hue), var(--santa-saturation), 85%);

display: grid;
gap: 2rem;
justify-items: start;
border: 2px solid var(--santa-red);
border-radius: 0.5rem;
padding: 2rem clamp(1.5rem, 5%, 2.5rem);
font-family: system-ui, sans-serif;
}

.santa-chat-app h2 {
margin: 0;
color: var(--santa-red);
}

.santa-listen {
all: unset;
cursor: pointer;
color: hsl(var(--santa-green-hue), var(--santa-saturation), 25%);
border-radius: 0.5rem;
border: 2px solid;
padding: 0.25em 0.5em;
}

.santa-listen:is(:focus, :focus-visible) {
outline: 2px dashed currentColor !important;
outline-offset: 2px;
}

.santa-listen[data-state=listening] {
color: var(--santa-red);
}

.santa-chat {
list-style: none;
margin: 0;
padding: 1.25rem 1.5rem;
justify-self: stretch;
height: 20em;
overflow-y: auto;
border: 1px dashed grey;
border-radius: 0.5rem;
display: flex;
flex-direction: column;
justify-content: flex-end;
gap: 0.75rem;
color: #222;
font-size: 1.15rem;
}

.santa-chat.active {
border-color: green;
box-shadow: 0 0 0 4px var(--sanata-green-light), 0 0 0 6px var(--santa-green);
}

.message {
position: relative;
border-radius: 0.75rem;
padding: 0.5rem 0.75rem;
}

.message::after {
content: "";
position: absolute;
top: 50%;
right: 0;
transform: translate(100%, -50%);
border: 0.5em solid transparent;
}

.message--user {
align-self: flex-end;
background-color: var(--sanata-green-light);
border: 1px solid var(--santa-green);
box-shadow: 2px 2px 0 var(--santa-green);
}

.message--user::after {
border-left-color: var(--santa-green);
}

.message--santa {
align-self: flex-start;
display: inline-flex;
align-items: center;
gap: 0.25em;
background-color: hsl(var(--santa-red-hue), var(--santa-saturation), 85%);
border: 1px solid var(--santa-red);
box-shadow: -2px 2px 0 var(--santa-red);
}

.message--santa::after {
border-right-color: var(--santa-red);
left: 0;
transform: translate(-100%, -50%);
}

.message--santa::before {
content: "πŸŽ…";
font-size: 1.15em;
transform: translateY(-5%);
}

Chat with Santa

  • Hello, Santa!
  • Ho ho ho - have you been good this year?

Adding Speech Recognition

To prepare our app for speech recognition, we'll initiate a connection as recommended by the MDN docs, including setting our language.

// Configure SpeechRecognition
var SpeechRecognition = SpeechRecognition || webkitSpeechRecognition;
const recognition = new SpeechRecognition();
recognition.lang = "en-US";

Then we'll access the two elements we'll be manipulating - the listen button and the unordered list - as well as create our custom activate and deactivate functions.

// Access chat elements
const listenButton = document.querySelector(".santa-listen");
const speechLog = document.querySelector(".santa-chat");

const activateChat = () => {
recognition.start();
listenButton.dataset.state = "listening";
listenButton.textContent = "Stop listening";
};

const deactivateChat = () => {
recognition.stop();
speechLog.classList.remove("active");
listenButton.dataset.state = "inactive";
listenButton.textContent = "Start listening";
};

The start() and stop() methods are part of the speech API. We're also toggling the state of the listening button by altering our custom data-state attribute and the actual button text.

You may have noticed we remove the active class for deactivation but don't set it as part of activation. That's because there's an API event we'll use for attaching that state instead - onstart. Functionally for our app, we could place it within activateChat(), but it's useful to know about the onstart event.

recognition.onstart = () => {
speechLog.classList.add("active");
};

In addition to onstart, we have two events that can signal that the API has stopped listening. The first is when the user actually stops talking which is onspeechend. The other is triggered when the API hits a timeout which in Chrome for example seems to be 60 seconds. This timeout seems to be more of a "session" time before the API requires the user to re-initiate the action to start the recognition.

We'll set both of these events to use deactivateChat() so that the app doesn't falsely appear to be listening.

recognition.onspeechend = () => {
deactivateChat();
};

recognition.onend = () => {
deactivateChat();
};

Speaking of the user initiating recognition - it's time to finally hook up our listening button. We'll use a simple click handler and check for the current data-state to continue with either activating or deactivating recognition.

listenButton.addEventListener("click", () => {
const state = listenButton.dataset.state;

if (state === "inactive") {
activateChat();
} else {
deactivateChat();
}
});

Providing the user's speech result

At this stage, you could test our app and you would get the prompt for permission to access your microphone. But we need to connect the pieces to be able to output what you say and have Santa give a response.

This will be done via the onresult event of the speech API. This is triggered when the recognition determines that a user has stopped speaking and has evaluated the speech. The important part here is the returned transcript value.

recognition.onresult = function (event) {
// get the results of the speech recognition
const transcript = event.results[0][0].transcript;
};

Once we have the transcript value, we'll output that back to the user so they can see what was "heard" by the recognition. We'll do this by creating a new list element with the user classes and appending it to the list.

// Create element for user message
const newMessage = document.createElement("li");
newMessage.classList.add("message", "message--user");
const newMessageTranscript = document.createTextNode(transcript);
newMessage.appendChild(newMessageTranscript);

// Add transcript to DOM
speechLog.append(newMessage);

Responding to recognized words and phrases

To create the illusion of Santa typing back, we'll set a small timeout then pass the transcript off to a function we'll create to handle the responses. This should be placed as the last thing in the onresult event function.

setTimeout(() => {
handleSantaResponse(transcript);
}, 500);

We'll kickstart the response function by adding a new list element populated with ellipses to allude to Santa "typing" a new message.

const handleSantaResponse = (transcript) => {
const newResponse = document.createElement("li");
newResponse.classList.add("message", "message--santa");
const typing = document.createTextNode("...");
newResponse.appendChild(typing);
speechLog.append(newResponse);
});

Then, to better be able to match against the transcript, we'll transform it to lowercase, remove punctuation, and trim extra whitespace.

transcript = transcript
.toLowerCase()
.replace(/[\.\?,!]/gm, "")
.trim();

To prepare for responding, we'll setup a default response if there are no matches. We'll also have a default delay and a boolean to indicate if we should continue with listening or not. The delay will be modified to add a touch of "realness" of having Santa think about his response.

let santaResponse = "Happy Holidays!";
let delay = 800;
let continueChat = true;

Then we'll have a series of simplistic searches for words and phrases to then alter the santaResponse, etc. Here's the initial one anticipating a greeting.

if (transcript.includes("hello") || transcript.includes("hi")) {
santaResponse = "Ho ho ho - have you been good this year?";
delay = 1200;
}

In the full demo you can see the other types of prepared responses. But I recommend keeping it hidden until you've tried out the chat first πŸ˜‰

After our prepared responses, we update the textContent of the list item we created and populated with ellipses earlier.

setTimeout(() => {
newResponse.textContent = santaResponse;

if (continueChat) {
activateChat();
} else {
deactivateChat();
}
}, delay);

We also automatically begin speech recognition as long as continueChat is true. The one instance where continueChat would not be true is when the user says "goodbye" or "bye". However, the API may experience a session timeout before that point which would trigger the onend event and the user would have to re-initiate the chat via the listen button.

That completes the setup for our Santa chat app! Try it out in the demo.

In my testing, you'll get the best results in Chromium. Safari seemed to take longer to respond and sometimes provided duplications in the transcript.

Demo of Santa Chat speech recognition application
// Configure SpeechRecognition
var SpeechRecognition = SpeechRecognition || webkitSpeechRecognition;
const recognition = new SpeechRecognition();
recognition.lang = "en-US";

// Access chat elements
const listenButton = document.querySelector("#demo .santa-listen");
const speechLog = document.querySelector("#demo .santa-chat");

const activateChat = () => {
recognition.start();
listenButton.dataset.state = "listening";
listenButton.textContent = "Stop listening";
};

const deactivateChat = () => {
recognition.stop();
speechLog.classList.remove("active");
listenButton.dataset.state = "inactive";
listenButton.textContent = "Start listening";
};

recognition.onstart = () => {
speechLog.classList.add("active");
};

recognition.onspeechend = () => {
deactivateChat();
};

recognition.onend = () => {
deactivateChat();
};

listenButton.addEventListener("click", () => {
const state = listenButton.dataset.state;

if (state === "inactive") {
activateChat();
} else {
deactivateChat();
}
});

recognition.onresult = function (event) {
// get the results of the speech recognition
const transcript = event.results[0][0].transcript;

// Create element for user message
const newMessage = document.createElement("li");
newMessage.classList.add("message", "message--user");
const newMessageTranscript = document.createTextNode(transcript);
newMessage.appendChild(newMessageTranscript);

// Add transcript to DOM
speechLog.append(newMessage);

setTimeout(() => {
handleSantaResponse(transcript);
}, 500);
};

const handleSantaResponse = (transcript) => {
const newResponse = document.createElement("li");
newResponse.classList.add("message", "message--santa");
const typing = document.createTextNode("...");
newResponse.appendChild(typing);
speechLog.append(newResponse);

transcript = transcript
.toLowerCase()
.replace(/[\.\?,!]/gm, "")
.trim();

let santaResponse = "Happy Holidays!";
let delay = 800;
let continueChat = true;

if (transcript.includes("hello") || transcript.includes("hi")) {
santaResponse = "Ho ho ho - have you been good this year?";
delay = 1200;
}

if (
transcript === "yes" ||
transcript.includes("been good") ||
transcript.includes("yes i have") ||
transcript.includes("i guess so") ||
transcript.includes("maybe")
) {
santaResponse = "Oh, jolly good! You'll have a full stocking!";
delay = 1000;
}

if (
transcript === "no" ||
transcript.includes("been bad") ||
transcript.includes("no i") ||
transcript.includes("naughty")
) {
santaResponse = "Oh dear! Rudolph will be sorry to hear that.";
delay = 1000;
}

if (transcript.includes("thank you")) {
santaResponse = "Of course! Anything else you want?";
}

if (
transcript.includes("how are you") ||
transcript.includes("you doing") ||
transcript.includes("you busy")) {
santaResponse = "Feeling jolly! What do you want this year?";
delay = 1200;
}

if (
transcript.includes("i want") ||
transcript.includes("i'd like") ||
transcript.includes("i would like") ||
transcript.includes("can i have")
) {
santaResponse = "That's quite a list, the elves and I will do our best!";
delay = 1200;
}

// includes does partial matching so this also works for "goodbye"
if (transcript.includes("bye")) {
santaResponse = "Ho ho ho - See ya real soon! Don't forget my cookies!";
continueChat = false;
}

setTimeout(() => {
newResponse.textContent = santaResponse;

if (continueChat) {
activateChat();
} else {
deactivateChat();
}
}, delay);
};

Chat with Santa

    Additional resources for the Web Speech API

    Given the experimental status and inability to work offline, the speech API isn't really ready to perform critical functions. But it can be fun for personal projects or for a limited well-known audience.