Wav2TextGrid

This package accompanies the following paper: https://doi.org/10.1044/2024_JSLHR-24-00347

Version 0.1.2 (Beta) — A phonetic forced aligner

Wav2TextGrid has been tested on Ubuntu 18.04, Ubuntu 22.04, Windows 11, and macOS 15.6.1

⚠️ This aligner is currently in development. If you encounter issues, please open an issue or email me directly: pkadambi@asu.edu

🚀 Quick Start (User Installation)

For Users - Install from PyPI

pip install Wav2TextGrid==0.1.2

⚠️ Note: The package installs the CPU version of PyTorch by default. Install your preferred PyTorch GPU support (v2.5+) BEFORE installing if you would like to use a GPU.

Requirements: Python 3.10+

Basic Usage

Align a single file:

w2tg /path/to/audio.wav /path/to/transcript.lab ./output.TextGrid

Align an entire directory:

w2tg /path/to/audio_dir/ /path/to/transcript_dir/ ./outputs/

🛠️ Development Setup

Prerequisites

Python 3.10+
uv - Modern Python package manager

Install uv

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# Or via pip
pip install uv

Clone and Setup Development Environment

# Clone the repository
git clone https://github.com/pkadambi/Wav2TextGrid.git
cd Wav2TextGrid

# Create virtual environment and install dependencies
uv sync

# Install pre-commit hooks
uv run --only-group dev pre-commit install

Development Commands (Makefile)

The project includes a Makefile with common development tasks:

# Code formatting
make format          # Format code with Ruff
make format-check    # Check formatting without changes

# Linting
make lint            # Fix linting issues with Ruff  
make lint-check      # Check linting without fixes

# Type checking
make mypy-check      # Run mypy type checking

# Cleanup
make fresh-slate     # Cleans python environment by deleting .venv and uv.lock

Dependency Groups

The project uses uv dependency groups defined in pyproject.toml:

dev: Development tools (pre-commit, ruff, mypy)
security: Security scanning tools (safety) ()

# Install only development dependencies
uv sync --only-group dev

# Install specific groups
uv sync --group dev --group security

# Run commands with specific groups
uv run --only-group dev ruff check .

Pre-commit Hooks

Set up pre-commit to ensure code quality:

# Install pre-commit hooks (run once)
uv run --only-group dev pre-commit install

# Run pre-commit on all files
uv run --only-group dev pre-commit run --all-files

# Pre-commit will now run automatically on git commits

📁 Data Format and Best Practices

Recommended Directory Structure

Follow the Kaldi-style organization for best results:

/Dataset/folder/
├── Speaker1/
│   ├── file1.lab
│   ├── file1.wav
│   ├── file2.lab
│   ├── file2.wav
├── Speaker2/
│   ├── file3.lab
│   ├── file3.wav
│   ├── file4.lab
│   ├── file4.wav

File Formats

Audio files: .wav format (16kHz recommended)
Transcript files: .lab format containing plain text transcriptions
Output: .TextGrid format compatible with Praat

⚙️ Training Custom Models

2. Train Your Model

w2tg_train /path/to/training_data/ /path/to/output_model/

3. Use Trained Model for Alignment

w2tg /path/to/audio.wav /path/to/transcript.lab ./output.TextGrid --aligner_model /path/to/your_model/

🏗️ Project Structure

Wav2TextGrid/
├── src/Wav2TextGrid/           # Main package source
│   ├── aligner_core/           # Core alignment algorithms
│   ├── utils/                  # Utility functions
│   ├── wav2textgrid.py         # Inference interface
│   └── wav2textgrid_train.py   # Training interface
├── scripts/                    # Development scripts
│   ├── run_inference_workflow.py  # CI/CD testing
│   └── test_local.py           # Local testing
├── examples/                   # Example audio/transcript pairs
├── .github/workflows/          # CI/CD pipelines
├── Makefile                    # Development commands
├── pyproject.toml              # Project configuration
└── uv.lock                     # Dependency lock file

🔧 Development Workflow

Making Changes

Create a feature branch:

git checkout -b feature/your-feature-name

Setup firewall using safety =>https://docs.safetycli.com/safety-docs/firewall/introduction-to-safety-firewall

Make your changes and ensure code quality:

make format      # Format code
make lint        # Fix linting issues
make mypy-check  # Type checking

Commit with pre-commit checks:

git add .
git commit -m "Your commit message"
# Pre-commit hooks will run automatically

Code Quality Standards

Formatting: Ruff formatter
Linting: Ruff linter with security checks
Type checking: mypy
Pre-commit hooks: Automatic checks on commit
CI/CD: GitHub Actions for multi-platform testing

Local Testing

🎵 Supported Formats

Audio Formats

Primary: .wav files (recommended: 16kHz sampling rate)
Alternative: .mp3 files (specify --filetype=mp3 for w2tg)
Training: Only .wav files supported for w2tg_train

Transcript Format

File extension: .lab
Content: Single line of text transcript
Example: SHE HAD YOUR DARK SUIT IN GREASY WASH WATER ALL YEAR

TextGrid Format

Output: Praat-compatible .TextGrid files
Structure: Contains IntervalTier named "phones"
Training: Must include phone-level alignments for training data

⚠️ Model Scope and Limitations

Training Data

Wav2TextGrid was trained on:

Demographics: Children ages 3–7 years
Dataset: 3,700 utterances from Test of Childhood Stuttering (TOCS)
Duration: Short utterances (~2–5 seconds)
Quality: Clean, non-conversational audio

Best Performance

✅ Child speech (ages 3–7)
✅ Short, clean utterances
✅ Read speech (non-conversational)
✅ High-quality audio recordings

Use With Caution

⚠️ Adult speech (not validated after fine-tuning)
⚠️ Long utterances (>5-10 seconds)
⚠️ Conversational or spontaneous speech
⚠️ Noisy audio recordings
⚠️ Mixed adult/child conversations

Recommendation: Always validate alignments when using outside the intended scope.

🔗 Resources

Hugging Face Model: pkadambi/wav2textgrid
PyPI Package: Wav2TextGrid
Issues: GitHub Issues
Contact: pkadambi@asu.edu

📄 Citation

If you use Wav2TextGrid in your research, please cite:

@software{kadambi2024wav2textgrid,
  title={Wav2TextGrid: A Phonetic Forced Aligner},
  author={Kadambi, Prad},
  year={2024},
  url={https://github.com/pkadambi/Wav2TextGrid},
  version={0.1.2}
}

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Based on techniques from Charsiu by lingjzhu and henrynomeland
Initialized with models trained on CommonVoice and LibriSpeech datasets
Fine-tuned on Test of Childhood Stuttering (TOCS) corpus
GUI Application for Windows (ongoing, will be complete by 05/31)
Add training functionality to GUI (longer term, likely by 06/31)

Attribution and Citation

Our paper can be found here: https://doi.org/10.1044/2024_JSLHR-24-00347

BibTeX

@article{kadambi2025tunable,
  title={A Tunable Forced Alignment System Based on Deep Learning: Applications to Child Speech},
  author={Kadambi, Prad and Mahr, Tristan J and Hustad, Katherine C and Berisha, Visar},
  journal={Journal of Speech, Language, and Hearing Research},
  pages={1--19},
  year={2025},
  publisher={American Speech-Language-Hearing Association}
}

MLA

Kadambi, Prad, et al. "A Tunable Forced Alignment System Based on Deep Learning: Applications to Child Speech." Journal of Speech, Language, and Hearing Research (2025): 1-19.

APA

Kadambi, P., Mahr, T. J., Hustad, K. C., & Berisha, V. (2025). A Tunable Forced Alignment System Based on Deep Learning: Applications to Child Speech. Journal of Speech, Language, and Hearing Research, 1-19.

Algorithm Details

Pronunciation Dictionary

Uses g2p_en (CMUdict) for text-to-phoneme conversion.

x-vector extraction

Voice Activity Detection (VAD): speechbrain/vad-crdnn-libriparty (Hugging Face)
xVector speaker embeddings: speechbrain/spkrec-ecapa-voxceleb (Hugging Face)

Forced Alignment

Frame-level phoneme prediction via Wav2Vec2, phoneme posterior distribution
Alignment via Viterbi decoding (10ms granularity)

Acknowledgements

This work was supported by the National Institute on Deafness and Other Communication Disorders of the National Institutes of Health under award number R01 DC019645.

Portions of this project adapt code from the excellent Charsiu aligner https://github.com/lingjzhu/charsiu.

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
.github/workflows		.github/workflows
examples		examples
scripts		scripts
src/Wav2TextGrid		src/Wav2TextGrid
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Wav2TextGrid

🚀 Quick Start (User Installation)

For Users - Install from PyPI

Basic Usage

🛠️ Development Setup

Prerequisites

Install uv

Clone and Setup Development Environment

Development Commands (Makefile)

Dependency Groups

Pre-commit Hooks

📁 Data Format and Best Practices

Recommended Directory Structure

File Formats

⚙️ Training Custom Models

2. Train Your Model

3. Use Trained Model for Alignment

🏗️ Project Structure

🔧 Development Workflow

Making Changes

Code Quality Standards

Local Testing

🎵 Supported Formats

Audio Formats

Transcript Format

TextGrid Format

⚠️ Model Scope and Limitations

Training Data

Best Performance

Use With Caution

🔗 Resources

📄 Citation

📜 License

🙏 Acknowledgments

Attribution and Citation

Algorithm Details

Pronunciation Dictionary

x-vector extraction

Forced Alignment

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages