Building Voice-Controlled UIs with AI Parsing

When your hands are wrapped around a barbell, the last thing you want is to fumble with your phone. This is the story of how I built voice-controlled workout logging using the Web Speech API and Google's Gemini AI.

The Problem

Logging workouts while exercising is awkward:

Sweaty hands make touchscreens unreliable
Mid-set, you can't stop to type
Small input fields are frustrating when you're fatigued
You just want to say "185 for 8" and move on

Traditional form inputs fail in physical contexts. Voice is the natural solution, but natural language is messy.

The Architecture

The system has three layers:

Voice Capture - Web Speech API captures spoken input
AI Parsing - Gemini converts natural language to structured data
Fallback Parser - Regex patterns catch common phrases if AI fails

This layered approach ensures reliability: if the AI is slow or fails, the fallback kicks in.

Key Implementation Details

1. The Command Interface

First, define what a parsed command looks like:

interface ParsedVoiceCommand {
  action: 'log' | 'skip' | 'note' | 'pr' | 'unknown';
  sets?: { weight: number | null; reps: number | null }[];
  note?: string;
  confidence: 'high' | 'medium' | 'low';
  rawText: string;
}

The confidence field is crucial; it lets the UI handle uncertainty gracefully.

2. Web Speech API Setup

const recognition = new SpeechRecognition();
recognition.continuous = false;
recognition.interimResults = true; // Show text as user speaks
recognition.lang = 'en-US';

recognition.onresult = (event) => {
  const result = event.results[event.resultIndex];
  const text = result[0].transcript;

  if (result.isFinal) {
    parseTranscript(text); // Send to AI
  }
};

Setting interimResults: true provides real-time feedback while the user speaks.

3. The AI Prompt

The prompt engineering is critical for reliable JSON output:

Parse the user's spoken workout input into structured data.

## Common Patterns to Recognize
- "185 for 8" → weight: 185, reps: 8
- "Just did 3 sets of 10 at 135" → 3 sets, weight: 135, reps: 10
- "Did 8, 8, and 6" → 3 sets with reps: [8, 8, 6]
- "Skip this set" → action: skip
- "Add a note: [text]" → action: note

## Response Format
Return JSON only:
{
  "action": "log" | "skip" | "note" | "pr" | "unknown",
  "sets": [{ "weight": number | null, "reps": number | null }],
  "confidence": "high" | "medium" | "low"
}

Key prompt techniques:

Examples first - Show the AI exactly what patterns to expect
Strict output format - Request JSON only, no prose
Confidence field - Let the AI express uncertainty

4. The Fallback Parser

AI isn't 100% reliable. The fallback uses regex for common patterns:

function fallbackParse(transcript: string): ParsedVoiceCommand {
  const text = transcript.toLowerCase();

  // "185 for 8" or "185 x 8"
  const weightRepsMatch = text.match(/(\d+)\s*(?:for|x|times|@)\s*(\d+)/i);
  if (weightRepsMatch) {
    return {
      action: 'log',
      sets: [{
        weight: parseFloat(weightRepsMatch[1]),
        reps: parseInt(weightRepsMatch[2])
      }],
      confidence: 'high',
      rawText: transcript
    };
  }

  // "8 reps at 185"
  const repsWeightMatch = text.match(/(\d+)\s*reps?\s*(?:at|with)\s*(\d+)/i);
  if (repsWeightMatch) {
    return {
      action: 'log',
      sets: [{
        weight: parseFloat(repsWeightMatch[2]),
        reps: parseInt(repsWeightMatch[1])
      }],
      confidence: 'high',
      rawText: transcript
    };
  }

  // Skip patterns
  if (text.includes('skip')) {
    return { action: 'skip', confidence: 'high', rawText: transcript };
  }

  return { action: 'unknown', confidence: 'low', rawText: transcript };
}

Lessons Learned

1. Always Have a Fallback

AI APIs can be slow, fail, or return unexpected formats. The regex fallback handles 80% of cases instantly.

2. Natural Language is Messy

People say the same thing many ways:

"185 for 8"
"8 reps at 185"
"185 pounds, 8 reps"
"Just did 8 at 185"

Your parser needs to handle all of these.

3. Confidence Enables Better UX

When confidence is "low", show the user what was parsed and let them correct it. Don't just assume.

4. Interim Results Feel Responsive

Showing the transcript as the user speaks (via interimResults: true) makes the UI feel instant, even if parsing takes a moment.

The Result

Users can now log sets by voice:

Tap the mic button
Say "225 for 6"
See the set logged instantly

No fumbling, no typing, no friction.

Try It Yourself

The core pattern is reusable for any voice-controlled feature:

Capture with Web Speech API
Parse with an LLM (structured output prompts)
Fallback to regex for reliability
Use confidence scores for graceful degradation

Voice interfaces are underutilized in web apps. With modern AI, they're surprisingly easy to build well.

The Problem

Logging workouts while exercising is awkward:

Sweaty hands make touchscreens unreliable
Mid-set, you can't stop to type
Small input fields are frustrating when you're fatigued
You just want to say "185 for 8" and move on

Traditional form inputs fail in physical contexts. Voice is the natural solution, but natural language is messy.

The Architecture

The system has three layers:

Voice Capture - Web Speech API captures spoken input
AI Parsing - Gemini converts natural language to structured data
Fallback Parser - Regex patterns catch common phrases if AI fails

This layered approach ensures reliability: if the AI is slow or fails, the fallback kicks in.

Key Implementation Details

1. The Command Interface

First, define what a parsed command looks like:

interface ParsedVoiceCommand {
  action: 'log' | 'skip' | 'note' | 'pr' | 'unknown';
  sets?: { weight: number | null; reps: number | null }[];
  note?: string;
  confidence: 'high' | 'medium' | 'low';
  rawText: string;
}

The confidence field is crucial; it lets the UI handle uncertainty gracefully.

2. Web Speech API Setup

const recognition = new SpeechRecognition();
recognition.continuous = false;
recognition.interimResults = true; // Show text as user speaks
recognition.lang = 'en-US';

recognition.onresult = (event) => {
  const result = event.results[event.resultIndex];
  const text = result[0].transcript;

  if (result.isFinal) {
    parseTranscript(text); // Send to AI
  }
};

Setting interimResults: true provides real-time feedback while the user speaks.

3. The AI Prompt

The prompt engineering is critical for reliable JSON output:

Parse the user's spoken workout input into structured data.

## Common Patterns to Recognize
- "185 for 8" → weight: 185, reps: 8
- "Just did 3 sets of 10 at 135" → 3 sets, weight: 135, reps: 10
- "Did 8, 8, and 6" → 3 sets with reps: [8, 8, 6]
- "Skip this set" → action: skip
- "Add a note: [text]" → action: note

## Response Format
Return JSON only:
{
  "action": "log" | "skip" | "note" | "pr" | "unknown",
  "sets": [{ "weight": number | null, "reps": number | null }],
  "confidence": "high" | "medium" | "low"
}

Key prompt techniques:

Examples first - Show the AI exactly what patterns to expect
Strict output format - Request JSON only, no prose
Confidence field - Let the AI express uncertainty

4. The Fallback Parser

AI isn't 100% reliable. The fallback uses regex for common patterns:

function fallbackParse(transcript: string): ParsedVoiceCommand {
  const text = transcript.toLowerCase();

  // "185 for 8" or "185 x 8"
  const weightRepsMatch = text.match(/(\d+)\s*(?:for|x|times|@)\s*(\d+)/i);
  if (weightRepsMatch) {
    return {
      action: 'log',
      sets: [{
        weight: parseFloat(weightRepsMatch[1]),
        reps: parseInt(weightRepsMatch[2])
      }],
      confidence: 'high',
      rawText: transcript
    };
  }

  // "8 reps at 185"
  const repsWeightMatch = text.match(/(\d+)\s*reps?\s*(?:at|with)\s*(\d+)/i);
  if (repsWeightMatch) {
    return {
      action: 'log',
      sets: [{
        weight: parseFloat(repsWeightMatch[2]),
        reps: parseInt(repsWeightMatch[1])
      }],
      confidence: 'high',
      rawText: transcript
    };
  }

  // Skip patterns
  if (text.includes('skip')) {
    return { action: 'skip', confidence: 'high', rawText: transcript };
  }

  return { action: 'unknown', confidence: 'low', rawText: transcript };
}

Lessons Learned

1. Always Have a Fallback

AI APIs can be slow, fail, or return unexpected formats. The regex fallback handles 80% of cases instantly.

2. Natural Language is Messy

People say the same thing many ways:

"185 for 8"
"8 reps at 185"
"185 pounds, 8 reps"
"Just did 8 at 185"

Your parser needs to handle all of these.

3. Confidence Enables Better UX

When confidence is "low", show the user what was parsed and let them correct it. Don't just assume.

4. Interim Results Feel Responsive

Showing the transcript as the user speaks (via interimResults: true) makes the UI feel instant, even if parsing takes a moment.

The Result

Users can now log sets by voice:

Tap the mic button
Say "225 for 6"
See the set logged instantly

No fumbling, no typing, no friction.

Try It Yourself

The core pattern is reusable for any voice-controlled feature:

Capture with Web Speech API
Parse with an LLM (structured output prompts)
Fallback to regex for reliability
Use confidence scores for graceful degradation

Voice interfaces are underutilized in web apps. With modern AI, they're surprisingly easy to build well.

Building Voice-Controlled UIs with AI Parsing

The Problem

The Architecture

Key Implementation Details

1. The Command Interface

2. Web Speech API Setup

3. The AI Prompt

4. The Fallback Parser

Lessons Learned

1. Always Have a Fallback

2. Natural Language is Messy

3. Confidence Enables Better UX

4. Interim Results Feel Responsive

The Result

Try It Yourself

Related Posts

IDMerit Exposed 1 Billion Identity Records. The Real Story Is Why They Had Them.

Google Proved It Can't Fix Prompt Injection. 27 Million Enterprise Users Are Deploying Anyway.

Anthropic's Safety Report Is Threat Intelligence. Most Enterprises Are Reading It Wrong.

The Problem

The Architecture

Key Implementation Details

1. The Command Interface

2. Web Speech API Setup

3. The AI Prompt

4. The Fallback Parser

Lessons Learned

1. Always Have a Fallback

2. Natural Language is Messy

3. Confidence Enables Better UX

4. Interim Results Feel Responsive

The Result

Try It Yourself