Discord Voice Transcriber Bot

A Discord bot that transcribes voice channel conversations using Google's Gemini AI API.

Features

Joins voice channels and listens to conversations
Transcribes spoken content in real-time using Gemini AI
Uses Silero voice activity detection (VAD) to filter out noise and background sounds
Displays placeholder messages immediately when speech is detected
Updates transcriptions in-place with real-time editing
Simple commands to start and stop transcription

Prerequisites

Node.js v22.14.0 or later (with built-in TypeScript support)
Discord Bot Token
Google Gemini API Key

Setup

Clone the repository
Install dependencies:
```
pnpm install
```
Create a .env file based on the example:
```
cp .env.example .env
```

Add your Discord and Gemini API credentials to the .env file:

DISCORD_TOKEN=your_discord_bot_token
GEMINI_API_KEY=your_gemini_api_key
LOG_LEVEL=4  # Optional: 1=error, 2=warn, 3=log, 4=info, 5=debug

Configure Privileged Intents in the Discord Developer Portal:
- Go to https://discord.com/developers/applications
- Select your bot application
- Go to the "Bot" section
- Under "Privileged Gateway Intents", enable:
  - MESSAGE CONTENT INTENT
- Save changes

Usage

Start the bot:

pnpm dev

In Discord, use the following commands:

!transcribe - Start transcribing the voice channel you're in
!stop - Stop transcription

Development

Run type checking:

pnpm typecheck

How It Works

The bot connects to a Discord voice channel
It captures audio streams from users as they speak
Voice activity detection (VAD) determines if actual speech is present
When speech is detected, a placeholder message is immediately created
Audio is processed and converted from 48kHz stereo to 16kHz mono PCM
The processed audio is sent to Gemini API for transcription
The placeholder message is updated in-place with the transcribed text
If no speech is detected, the message is deleted to keep the channel clean

Technical Details

Uses the Silero VAD model to differentiate speech from noise
Implements a hysteresis pattern with different activation/deactivation thresholds
Process audio in small chunks (~120ms) for real-time detection
Leverages async iterators for modern stream processing
Uses consola for structured logging with configurable verbosity levels

License

ISC

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Discord Voice Transcriber Bot

Features

Prerequisites

Setup

Usage

Development

How It Works

Technical Details

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Discord Voice Transcriber Bot

Features

Prerequisites

Setup

Usage

Development

How It Works

Technical Details

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages