Chat data cleaning, filtering and deduplication pipeline.
-
Updated
Jul 25, 2023 - Python
Chat data cleaning, filtering and deduplication pipeline.
Dolphin 3.0 🐬: Versatile AI for coding, math, and more
About working Propmting in OpenAI models, it is also used with deffrent pettren Alpaca prompt, INST prompt
A Python-based interactive CLI interface for chatting with Hugging Face language models, optimized for casual, Discord-style conversation using ChatML. Supports both quantized and full-precision models, live token streaming with color formatting, and dynamic generation parameter adjustment.
Standardized spec and vendor-specific transforms for ChatML
A dataset toolbox for preparing and analyzing conversational datasets, including CSV splitting, CSV → Parquet conversion, dataset statistics, Parquet cleaning and sorting, HuggingFace–style metadata generation, and batched chain insertion into PostgreSQL — with Rich progress, multiprocessing, and 32 GB-RAM-friendly batching.
Deepseek-Dataset-Generator creates conversational datasets for LLM fine-tuning via DeepSeek API. Supports various formats (ChatML, ShareGPT, Alpaca, JSON, CSV), easy configuration via YAML and detailed logs. Ideal for generating realistic and customized data quickly.
Week 5 project: build a hybrid retriever that fuses FAISS dense vectors with SQLite FTS5/BM25 keyword search (RRF/weighted fusion), plus a Supervised Fine-Tuning (SFT) pipeline (Full FT vs LoRA/QLoRA) using TRL/PEFT/DeepSpeed.
138M param ChatML training stack optimized for Apple Silicon via MLX. Features a curated Quality2K continuation curriculum and v18 SFT alignment.
Add a description, image, and links to the chatml topic page so that developers can more easily learn about it.
To associate your repository with the chatml topic, visit your repo's landing page and select "manage topics."