When your hands are wrapped around a barbell, the last thing you want is to fumble with your phone. This is the story of how I built voice-controlled workout logging using the Web Speech API and Google's Gemini AI.
The Problem
Logging workouts while exercising is awkward:
- Sweaty hands make touchscreens unreliable
- Mid-set, you can't stop to type
- Small input fields are frustrating when you're fatigued
- You just want to say "185 for 8" and move on
Traditional form inputs fail in physical contexts. Voice is the natural solution, but natural language is messy.
The Architecture
The system has three layers:
- Voice Capture - Web Speech API captures spoken input
- AI Parsing - Gemini converts natural language to structured data
- Fallback Parser - Regex patterns catch common phrases if AI fails
This layered approach ensures reliability: if the AI is slow or fails, the fallback kicks in.
Key Implementation Details
1. The Command Interface
First, define what a parsed command looks like:
interface ParsedVoiceCommand {
action: 'log' | 'skip' | 'note' | 'pr' | 'unknown';
sets?: { weight: number | null; reps: number | null }[];
note?: string;
confidence: 'high' | 'medium' | 'low';
rawText: string;
}
The confidence field is crucial; it lets the UI handle uncertainty gracefully.
2. Web Speech API Setup
const recognition = new SpeechRecognition();
recognition.continuous = false;
recognition.interimResults = true;
recognition.lang = 'en-US';
recognition.onresult = (event) => {
const result = event.results[event.resultIndex];
const text = result[0].transcript;
if (result.isFinal) {
parseTranscript(text);
}
};
Setting interimResults: true provides real-time feedback while the user speaks.
3. The AI Prompt
The prompt engineering is critical for reliable JSON output:
Parse the user's spoken workout input into structured data.
## Common Patterns to Recognize
- "185 for 8" → weight: 185, reps: 8
- "Just did 3 sets of 10 at 135" → 3 sets, weight: 135, reps: 10
- "Did 8, 8, and 6" → 3 sets with reps: [8, 8, 6]
- "Skip this set" → action: skip
- "Add a note: [text]" → action: note
## Response Format
Return JSON only:
{
"action": "log" | "skip" | "note" | "pr" | "unknown",
"sets": [{ "weight": number | null, "reps": number | null }],
"confidence": "high" | "medium" | "low"
}
Key prompt techniques:
- Examples first - Show the AI exactly what patterns to expect
- Strict output format - Request JSON only, no prose
- Confidence field - Let the AI express uncertainty
4. The Fallback Parser
AI isn't 100% reliable. The fallback uses regex for common patterns:
function fallbackParse(transcript: string): ParsedVoiceCommand {
const text = transcript.toLowerCase();
const weightRepsMatch = text.match(/(\d+)\s*(?:for|x|times|@)\s*(\d+)/i);
if (weightRepsMatch) {
return {
action: 'log',
sets: [{
weight: parseFloat(weightRepsMatch[1]),
reps: parseInt(weightRepsMatch[2])
}],
confidence: 'high',
rawText: transcript
};
}
const repsWeightMatch = text.match(/(\d+)\s*reps?\s*(?:at|with)\s*(\d+)/i);
if (repsWeightMatch) {
return {
action: 'log',
sets: [{
weight: parseFloat(repsWeightMatch[2]),
reps: parseInt(repsWeightMatch[1])
}],
confidence: 'high',
rawText: transcript
};
}
if (text.includes('skip')) {
return { action: 'skip', confidence: 'high', rawText: transcript };
}
return { action: 'unknown', confidence: 'low', rawText: transcript };
}
Lessons Learned
1. Always Have a Fallback
AI APIs can be slow, fail, or return unexpected formats. The regex fallback handles 80% of cases instantly.
2. Natural Language is Messy
People say the same thing many ways:
- "185 for 8"
- "8 reps at 185"
- "185 pounds, 8 reps"
- "Just did 8 at 185"
Your parser needs to handle all of these.
3. Confidence Enables Better UX
When confidence is "low", show the user what was parsed and let them correct it. Don't just assume.
4. Interim Results Feel Responsive
Showing the transcript as the user speaks (via interimResults: true) makes the UI feel instant, even if parsing takes a moment.
The Result
Users can now log sets by voice:
- Tap the mic button
- Say "225 for 6"
- See the set logged instantly
No fumbling, no typing, no friction.
Try It Yourself
The core pattern is reusable for any voice-controlled feature:
- Capture with Web Speech API
- Parse with an LLM (structured output prompts)
- Fallback to regex for reliability
- Use confidence scores for graceful degradation
Voice interfaces are underutilized in web apps. With modern AI, they're surprisingly easy to build well.