MOSS-VL is the core multimodal model series within the OpenMOSS ecosystem, dedicated to advancing visual understanding. To tackle the inherent complexities of video comprehension, our roadmap pursues a systematic scaling strategy along three key dimensions:
- π Data Scaling: Curating massive-scale, high-quality multimodal datasets to drive robust generalization.
- π§ Parameter Scaling: Expanding model capacity to capture intricate vision-language correlations.
- β³ Context Scaling: Extending temporal horizons to enable reasoning over long-form video content.
- π₯ News
- ποΈ Model Architecture
- π§© Absolute Timestamps
- 𧬠Cross-attention RoPE (XRoPE)
- π¬ Demo
- π Training Strategy
- π Evaluation Results
- π Quick Start
- π₯ Model Download
- π Roadmap & TODO List
- π Citation
- 2026/04/08: π Released MOSS-VL-Base-0408 and MOSS-VL-Instruct-0408.
- 2026/04/03: π Finished both pre-training and SFT for MOSS-VL.
- 2025/10/18: π Kicked off the MOSS-VL project.
- 2025/09/30: β¨ Finished training MOSS-Video-Preview .
MOSS-VL adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. This design significantly reduces latency, enabling instantaneous responses to dynamic video streams. Natively supporting interleaved modalities, it processes complex sequences of images and videos within a unified pipeline β eliminating the need for heavy pre-processing.
Figure 1: Overall architecture of MOSS-VL.
To ensure the model accurately perceives the pacing and duration of events, MOSS-VL injects absolute timestamps alongside each sampled frame, grounding the reasoning process in a precise temporal reference.
Figure 2: Illustration of timestamped video sequence input.
Each video is interleaved with precise time markers, where each timestamp is wrapped by dedicated special tokens (<|time_start|> β¦ <|time_end|>) that explicitly anchor the temporal location of every visual frame:
<|im_start|><|vision_start|>
<|time_start|>0.0 seconds<|time_end|><|image_pad|>
<|time_start|>1.2 seconds<|time_end|><|image_pad|>
<|time_start|>2.3 seconds<|time_end|><|image_pad|>
...
<|vision_end|>The video shows a dynamic scene with continuous actions...<|im_end|>
π Why this matters:
- Adaptability to Variable FPS: The use of explicit timestamps allows the model to handle non-uniform sampling rates without loss of temporal context.
- Precise Temporal Analysis: Absolute time unlocks fine-grained action localization, grounding every response in exact temporal coordinates.
-
Motion Dynamics: By exposing time intervals (
$dt$ ), the model can reason about movement physics, enabling accurate estimation of velocity, acceleration, and trajectory.
MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention based visionβlanguage architecture. This mechanism maps text tokens and video patches into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w).
Figure 3: MOSS-VL with Cross-attention RoPE (XRoPE).
To optimize cross-modal alignment, XRoPE is injected into the vision Key (K) for position-awareness while leaving the Value (V) untouched to preserve feature fidelity. In parallel, it is applied to the text Query (Q), allowing the model to probe arbitrary spatio-temporal regions through direct coordinate alignment.
π Why this matters
- Unified Modality Modeling β By expressing time as a shared dimension across both language and video, XRoPE enables seamless, cohesive video-text reasoning within a single coordinate system.
-
Precise Grounding β Aligned (
$t, h, w$ ) coordinates empower the model to localize small objects and transient actions anywhere in the 3D video volume β down to the patch and the moment. - Dynamic Input Support β The 3D grid natively accommodates arbitrary aspect ratios and resolutions, eliminating the need for fixed-length padding or rigid input constraints.
Videocounting_demo.mp4
Videocaption_demo.mp4
For more examples, please visit our Interactive Demo Page π
MOSS-VL is trained using a multi-stage approach to progressively build multimodal capabilities.
Figure 4: Overall training data distribution of MOSS-VL.
MOSS-VL is pre-trained via a systematic four-stage curriculum that progressively builds up multimodal capabilities from the ground up:
-
Stage 1 β Vision-Language Alignment β Establishes the initial bridge between visual features and the language space. Training on large-scale image-text pairs, the model learns to associate visual concepts with their textual counterparts while developing foundational OCR skills for text-in-image understanding.
-
Stage 2 β Large-Scale Multimodal Pre-training β Scales up exposure to massive, diverse multimodal corpora, broadening the model's grasp of world knowledge and complex scenes β laying a robust foundation for general-purpose intelligence and high-resolution perception. In addition, short video clips are introduced at this stage to seed preliminary video understanding.
-
Stage 3 β High-Quality Multimodal Pre-training β Elevates overall model quality by training on large volumes of high-quality perception, understanding, and reasoning data. This phase combines fine-grained image perception, complex multi-image comprehension, and high-fidelity video reasoning to sharpen the model's ability to capture intricate visual details and master temporal relationships across rich multimodal inputs.
-
Stage 4 β Annealing & Long-Context Extrapolation β Stretches the model's horizon toward long-form video understanding, while a carefully designed annealing schedule trains on curated, top-tier multimodal data to push final performance to its peak.
| Stage | Strategy | Data Composition |
|---|---|---|
| 1 | Vision-Language Alignment | ![]() |
| 2 | Large-Scale Multimodal Pre-training | ![]() |
| 3 | High-Quality Multimodal Pre-training | ![]() |
| 4 | Annealing & Long-Context Extrapolation | ![]() |
Building on the pre-trained foundation, MOSS-VL is further refined through Supervised Fine-Tuning (SFT) to align with human intent and unlock its full interactive and instruction-following capabilities.
Figure 5: Data composition of MOSS-VL SFT.
Note
MOSS-VL is currently undergoing RLHF training. Stay tuned for updates.
We conducted a comprehensive evaluation of MOSS-VL across four key dimensions: Multimodal Perception, Multimodal Reasoning,Document/OCR, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in general multimodal perception and complex video analysis.
The table below reports benchmark scores on a 0β100 scale. Across the board, MOSS-VL consistently ranks first or second when compared against industry-leading baselines such as Qwen2.5-VL and Qwen3-VL.
Figure 6: Detailed benchmark comparison between MOSS-VL and Qwen series.
- π Leading Video Intelligence: MOSS-VL achieves a score of 65.8 in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like
VideoMME,MLVU,EgoSchema, andVSI-bench(where it outperforms Qwen3-VL-8B-Instruct by 8.3 points). - ποΈ Outstanding Multimodal Perception: MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like
BLINKandMMBench. - π§ Robust Multimodal Reasoning: MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites such as
CVBenchandVisuLogic. - π Reliable Document Understanding: While the model is primarily optimized for general perception and video, MOSS-VL still delivers 83.9 on OCR and document analysis, ensuring dependable extraction of text and structured information.
The chart below visualizes MOSS-VL's balanced and well-rounded capability profile across 30+ specialized benchmarks. Represented by the solid blue region, MOSS-VL achieves the broadest overall coverage, with particularly strong showings in the Video Understanding and Multimodal Perception quadrants.
Figure 7: Benchmark analysis of MOSS-VL.
conda create -n moss_vl python=3.12 pip -y
conda activate moss_vl
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txtFor complete runnable examples and demo assets, see inference/README.md.
import queue
import threading
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "/path/to/dummy-checkpoint"
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
query = {
"messages": [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/example.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
],
"media_kwargs": {},
"generate_kwargs": {
"max_new_tokens": 256,
"do_sample": False,
"vision_chunked_length": 64,
},
}
input_queue = queue.Queue()
output_queue = queue.Queue()
worker = threading.Thread(
target=model.offline_generate,
args=(processor, input_queue, output_queue),
kwargs={"vision_chunked_length": 64},
daemon=True,
)
worker.start()
input_queue.put(query)
text_chunks = []
while True:
item = output_queue.get()
if item in {"<|round_start|>"}:
continue
if item == "<|round_end|>":
break
text_chunks.append(item)
print("".join(text_chunks))
input_queue.put({"stop_offline_generate": True})
worker.join()For simple batched offline inference, you can also use offline_batch_generate:
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
checkpoint = "/path/to/dummy-checkpoint"
processor = AutoProcessor.from_pretrained(
checkpoint,
trust_remote_code=True,
frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
checkpoint,
trust_remote_code=True,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
queries = [
{
"messages": [
{
"role": "user",
"content": [{"type": "text", "text": "Describe sample A."}],
}
],
"media_kwargs": {},
"generate_kwargs": {"max_new_tokens": 256, "do_sample": False},
},
{
"messages": [
{
"role": "user",
"content": [{"type": "text", "text": "Describe sample B."}],
}
],
"media_kwargs": {},
"generate_kwargs": {"max_new_tokens": 256, "do_sample": False},
},
]
with torch.no_grad():
result = model.offline_batch_generate(
processor,
queries,
vision_chunked_length=64,
)
texts = [item["text"] for item in result["results"]]
print(texts)We provide a lightweight SFT framework built on HuggingFace transformers.Trainer. It supports full-parameter training, LoRA, with the vision encoder, language model, and LM head independently controllable.
# Full-parameter SFT (vision encoder frozen by default)
bash mossvl_finetune/scripts/run_sft.sh
# LoRA SFT
pip install -i https://pypi.org/simple peft
bash mossvl_finetune/scripts/run_sft_lora.shTraining data uses a simple JSON format compatible with the inference query structure β just add a response field:
[
{
"prompt": "Describe this image.",
"response": "A beautiful landscape with mountains.",
"images": ["path/to/image.jpg"],
"videos": []
}
]Multi-turn conversations are also supported. See mossvl_finetune/README.md for full documentation.
| Model | π€Download Link | π€ModelScope Link |
|---|---|---|
| MOSS-VL-Base-0408 | HuggingFace | ModelScope |
| MOSS-VL-Instruct-0408 | HuggingFace | ModelScope |
- Core Architecture: Implementation of Cross-attention RoPE (XRoPE).
- High-performance Infra: Integrated Megatron-LM + CUDA Flash Attention 3.
- Model Release: Open-sourced
MOSS-VL-BaseandMOSS-VL-Instruct. - Inference: Inference code for both image and video understanding.
- Training Engine: Full training code for MOSS-VL.
- Real-time Capabilities: Specialized Real-time Video Understanding Model.
- RL Post-training: Reinforcement Learning for MOSS-VL series.
- Documentation: Comprehensive Technical Report.
We would like to express our gratitude to NVIDIA for the Megatron-LM framework and the Qwen Team for their powerful Qwen series language models, which serve as the foundation of our training infrastructure and core LLM.
@misc{moss_vl_2026,
title = {{MOSS-VL Technical Report}},
author = {OpenMOSS Team},
year = {2026},
howpublished = {\url{https://github.com/OpenMOSS/MOSS-VL}},
note = {GitHub repository}
}Built with β€οΈ by the OpenMOSS Team




