A Discord bot that transcribes voice channel conversations using Google's Gemini AI API.
- Joins voice channels and listens to conversations
- Transcribes spoken content in real-time using Gemini AI
- Uses Silero voice activity detection (VAD) to filter out noise and background sounds
- Displays placeholder messages immediately when speech is detected
- Updates transcriptions in-place with real-time editing
- Simple commands to start and stop transcription
- Node.js v22.14.0 or later (with built-in TypeScript support)
- Discord Bot Token
- Google Gemini API Key
- Clone the repository
- Install dependencies:
pnpm install - Create a
.envfile based on the example:cp .env.example .env - Add your Discord and Gemini API credentials to the
.envfile:DISCORD_TOKEN=your_discord_bot_token GEMINI_API_KEY=your_gemini_api_key LOG_LEVEL=4 # Optional: 1=error, 2=warn, 3=log, 4=info, 5=debug - Configure Privileged Intents in the Discord Developer Portal:
- Go to https://discord.com/developers/applications
- Select your bot application
- Go to the "Bot" section
- Under "Privileged Gateway Intents", enable:
- MESSAGE CONTENT INTENT
- Save changes
Start the bot:
pnpm dev
In Discord, use the following commands:
!transcribe- Start transcribing the voice channel you're in!stop- Stop transcription
Run type checking:
pnpm typecheck
- The bot connects to a Discord voice channel
- It captures audio streams from users as they speak
- Voice activity detection (VAD) determines if actual speech is present
- When speech is detected, a placeholder message is immediately created
- Audio is processed and converted from 48kHz stereo to 16kHz mono PCM
- The processed audio is sent to Gemini API for transcription
- The placeholder message is updated in-place with the transcribed text
- If no speech is detected, the message is deleted to keep the channel clean
- Uses the Silero VAD model to differentiate speech from noise
- Implements a hysteresis pattern with different activation/deactivation thresholds
- Process audio in small chunks (~120ms) for real-time detection
- Leverages async iterators for modern stream processing
- Uses consola for structured logging with configurable verbosity levels
ISC