This package accompanies the following paper: https://doi.org/10.1044/2024_JSLHR-24-00347
Version 0.1.2 (Beta) — A phonetic forced aligner
Wav2TextGrid has been tested on Ubuntu 18.04, Ubuntu 22.04, Windows 11, and macOS 15.6.1
pip install Wav2TextGrid==0.1.2Requirements: Python 3.10+
-
Align a single file:
w2tg /path/to/audio.wav /path/to/transcript.lab ./output.TextGrid
-
Align an entire directory:
w2tg /path/to/audio_dir/ /path/to/transcript_dir/ ./outputs/
- Python 3.10+
- uv - Modern Python package manager
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
# Or via pip
pip install uv# Clone the repository
git clone https://github.com/pkadambi/Wav2TextGrid.git
cd Wav2TextGrid
# Create virtual environment and install dependencies
uv sync
# Install pre-commit hooks
uv run --only-group dev pre-commit installThe project includes a Makefile with common development tasks:
# Code formatting
make format # Format code with Ruff
make format-check # Check formatting without changes
# Linting
make lint # Fix linting issues with Ruff
make lint-check # Check linting without fixes
# Type checking
make mypy-check # Run mypy type checking
# Cleanup
make fresh-slate # Cleans python environment by deleting .venv and uv.lockThe project uses uv dependency groups defined in pyproject.toml:
dev: Development tools (pre-commit, ruff, mypy)security: Security scanning tools (safety) ()
# Install only development dependencies
uv sync --only-group dev
# Install specific groups
uv sync --group dev --group security
# Run commands with specific groups
uv run --only-group dev ruff check .Set up pre-commit to ensure code quality:
# Install pre-commit hooks (run once)
uv run --only-group dev pre-commit install
# Run pre-commit on all files
uv run --only-group dev pre-commit run --all-files
# Pre-commit will now run automatically on git commitsFollow the Kaldi-style organization for best results:
/Dataset/folder/
├── Speaker1/
│ ├── file1.lab
│ ├── file1.wav
│ ├── file2.lab
│ ├── file2.wav
├── Speaker2/
│ ├── file3.lab
│ ├── file3.wav
│ ├── file4.lab
│ ├── file4.wav
- Audio files:
.wavformat (16kHz recommended) - Transcript files:
.labformat containing plain text transcriptions - Output:
.TextGridformat compatible with Praat
w2tg_train /path/to/training_data/ /path/to/output_model/w2tg /path/to/audio.wav /path/to/transcript.lab ./output.TextGrid --aligner_model /path/to/your_model/Wav2TextGrid/
├── src/Wav2TextGrid/ # Main package source
│ ├── aligner_core/ # Core alignment algorithms
│ ├── utils/ # Utility functions
│ ├── wav2textgrid.py # Inference interface
│ └── wav2textgrid_train.py # Training interface
├── scripts/ # Development scripts
│ ├── run_inference_workflow.py # CI/CD testing
│ └── test_local.py # Local testing
├── examples/ # Example audio/transcript pairs
├── .github/workflows/ # CI/CD pipelines
├── Makefile # Development commands
├── pyproject.toml # Project configuration
└── uv.lock # Dependency lock file
-
Create a feature branch:
git checkout -b feature/your-feature-name
-
Setup firewall using safety =>https://docs.safetycli.com/safety-docs/firewall/introduction-to-safety-firewall
-
Make your changes and ensure code quality:
make format # Format code make lint # Fix linting issues make mypy-check # Type checking
-
Commit with pre-commit checks:
git add . git commit -m "Your commit message" # Pre-commit hooks will run automatically
- Formatting: Ruff formatter
- Linting: Ruff linter with security checks
- Type checking: mypy
- Pre-commit hooks: Automatic checks on commit
- CI/CD: GitHub Actions for multi-platform testing
- Primary:
.wavfiles (recommended: 16kHz sampling rate) - Alternative:
.mp3files (specify--filetype=mp3forw2tg) - Training: Only
.wavfiles supported forw2tg_train
- File extension:
.lab - Content: Single line of text transcript
- Example:
SHE HAD YOUR DARK SUIT IN GREASY WASH WATER ALL YEAR
- Output: Praat-compatible
.TextGridfiles - Structure: Contains
IntervalTiernamed "phones" - Training: Must include phone-level alignments for training data
Wav2TextGrid was trained on:
- Demographics: Children ages 3–7 years
- Dataset: 3,700 utterances from Test of Childhood Stuttering (TOCS)
- Duration: Short utterances (~2–5 seconds)
- Quality: Clean, non-conversational audio
- ✅ Child speech (ages 3–7)
- ✅ Short, clean utterances
- ✅ Read speech (non-conversational)
- ✅ High-quality audio recordings
⚠️ Adult speech (not validated after fine-tuning)⚠️ Long utterances (>5-10 seconds)⚠️ Conversational or spontaneous speech⚠️ Noisy audio recordings⚠️ Mixed adult/child conversations
Recommendation: Always validate alignments when using outside the intended scope.
- Hugging Face Model: pkadambi/wav2textgrid
- PyPI Package: Wav2TextGrid
- Issues: GitHub Issues
- Contact: pkadambi@asu.edu
If you use Wav2TextGrid in your research, please cite:
@software{kadambi2024wav2textgrid,
title={Wav2TextGrid: A Phonetic Forced Aligner},
author={Kadambi, Prad},
year={2024},
url={https://github.com/pkadambi/Wav2TextGrid},
version={0.1.2}
}This project is licensed under the MIT License - see the LICENSE file for details.
- Based on techniques from Charsiu by lingjzhu and henrynomeland
- Initialized with models trained on CommonVoice and LibriSpeech datasets
- Fine-tuned on Test of Childhood Stuttering (TOCS) corpus
- GUI Application for Windows (ongoing, will be complete by 05/31)
- Add training functionality to GUI (longer term, likely by 06/31)
- Our paper can be found here: https://doi.org/10.1044/2024_JSLHR-24-00347
BibTeX
@article{kadambi2025tunable,
title={A Tunable Forced Alignment System Based on Deep Learning: Applications to Child Speech},
author={Kadambi, Prad and Mahr, Tristan J and Hustad, Katherine C and Berisha, Visar},
journal={Journal of Speech, Language, and Hearing Research},
pages={1--19},
year={2025},
publisher={American Speech-Language-Hearing Association}
}
MLA
Kadambi, Prad, et al. "A Tunable Forced Alignment System Based on Deep Learning: Applications to Child Speech." Journal of Speech, Language, and Hearing Research (2025): 1-19.
APA
Kadambi, P., Mahr, T. J., Hustad, K. C., & Berisha, V. (2025). A Tunable Forced Alignment System Based on Deep Learning: Applications to Child Speech. Journal of Speech, Language, and Hearing Research, 1-19.
Uses g2p_en (CMUdict) for text-to-phoneme conversion.
- Voice Activity Detection (VAD):
speechbrain/vad-crdnn-libriparty(Hugging Face) - xVector speaker embeddings:
speechbrain/spkrec-ecapa-voxceleb(Hugging Face)
-
Frame-level phoneme prediction via Wav2Vec2, phoneme posterior distribution
-
Alignment via Viterbi decoding (10ms granularity)
This work was supported by the National Institute on Deafness and Other Communication Disorders of the National Institutes of Health under award number R01 DC019645.
Portions of this project adapt code from the excellent Charsiu aligner https://github.com/lingjzhu/charsiu.