v1.1.0

Gemini STT

araa47 araa47 ← All skills

Transcribe audio files using Google's Gemini API or Vertex AI

Downloads
1.2k
Stars
1
Versions
2
Updated
2026-02-24

Install

npx clawhub@latest install gemini-stt

Documentation

Gemini Speech-to-Text Skill

Transcribe audio files using Google's Gemini API or Vertex AI. Default model is gemini-2.0-flash-lite for fastest transcription.

Authentication (choose one)

Option 1: Vertex AI with Application Default Credentials (Recommended)

gcloud auth application-default login

gcloud config set project YOUR_PROJECT_ID

The script will automatically detect and use ADC when available.

Option 2: Direct Gemini API Key

Set GEMINI_API_KEY in environment (e.g., ~/.env or ~/.clawdbot/.env)

Requirements

  • -Python 3.10+ (no external dependencies)
  • -Either GEMINI_API_KEY or gcloud CLI with ADC configured

Supported Formats

  • -.ogg / .opus (Telegram voice messages)
  • -.mp3
  • -.wav
  • -.m4a

Usage

Auto-detect auth (tries ADC first, then GEMINI_API_KEY)

python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg

Force Vertex AI

python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex

With a specific model

python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --model gemini-2.5-pro

Vertex AI with specific project and region

python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex --project my-project --region us-central1

With Clawdbot media

python ~/.claude/skills/gemini-stt/transcribe.py ~/.clawdbot/media/inbound/voice-message.ogg

Options

| Option | Description |

|--------|-------------|

| <audio_file> | Path to the audio file (required) |

| --model, -m | Gemini model to use (default: gemini-2.0-flash-lite) |

| --vertex, -v | Force use of Vertex AI with ADC |

| --project, -p | GCP project ID (for Vertex, defaults to gcloud config) |

| --region, -r | GCP region (for Vertex, default: us-central1) |

Supported Models

Any Gemini model that supports audio input can be used. Recommended models:

| Model | Notes |

|-------|-------|

| gemini-2.0-flash-lite | Default. Fastest transcription speed. |

| gemini-2.0-flash | Fast and cost-effective. |

| gemini-2.5-flash-lite | Lightweight 2.5 model. |

| gemini-2.5-flash | Balanced speed and quality. |

| gemini-2.5-pro | Higher quality, slower. |

| gemini-3-flash-preview | Latest flash model. |

| gemini-3-pro-preview | Latest pro model, best quality. |

See [Gemini API Models](https://ai.google.dev/gemini-api/docs/models) for the latest list.

How It Works

1. Reads the audio file and base64 encodes it

2. Auto-detects authentication:

- If ADC is available (gcloud), uses Vertex AI endpoint

- Otherwise, uses GEMINI_API_KEY with direct Gemini API

3. Sends to the selected Gemini model with transcription prompt

4. Returns the transcribed text

Example Integration

For Clawdbot voice message handling:

Transcribe incoming voice message

TRANSCRIPT=$(python ~/.claude/skills/gemini-stt/transcribe.py "$AUDIO_PATH")

echo "User said: $TRANSCRIPT"

Error Handling

The script exits with code 1 and prints to stderr on:

  • -No authentication available (neither ADC nor GEMINI_API_KEY)
  • -File not found
  • -API errors
  • -Missing GCP project (when using Vertex)

Notes

  • -Uses Gemini 2.0 Flash Lite by default for fastest transcription
  • -No external Python dependencies (uses stdlib only)
  • -Automatically detects MIME type from file extension
  • -Prefers Vertex AI with ADC when available (no API key management needed)

Launch an agent with Gemini STT on Termo.