Skip to content

OpenMOSS/MOSS-VL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ModelScope
license

English | δΈ­ζ–‡

MOSS-VL

MOSS-VL is the core multimodal model series within the OpenMOSS ecosystem, dedicated to advancing visual understanding. To tackle the inherent complexities of video comprehension, our roadmap pursues a systematic scaling strategy along three key dimensions:

  • πŸ“ˆ Data Scaling: Curating massive-scale, high-quality multimodal datasets to drive robust generalization.
  • 🧠 Parameter Scaling: Expanding model capacity to capture intricate vision-language correlations.
  • ⏳ Context Scaling: Extending temporal horizons to enable reasoning over long-form video content.

πŸ“Œ Table of Contents


πŸ”₯ News

  • 2026/04/08: πŸš€ Released MOSS-VL-Base-0408 and MOSS-VL-Instruct-0408.
  • 2026/04/03: πŸ† Finished both pre-training and SFT for MOSS-VL.
  • 2025/10/18: πŸ” Kicked off the MOSS-VL project.
  • 2025/09/30: ✨ Finished training MOSS-Video-Preview .

πŸ—οΈ Model Architecture

MOSS-VL adopts a cross-attention-based architecture that decouples visual encoding from cognitive reasoning. This design significantly reduces latency, enabling instantaneous responses to dynamic video streams. Natively supporting interleaved modalities, it processes complex sequences of images and videos within a unified pipeline β€” eliminating the need for heavy pre-processing.

MOSS-VL Architecture
Figure 1: Overall architecture of MOSS-VL.


🧩 Absolute Timestamps

To ensure the model accurately perceives the pacing and duration of events, MOSS-VL injects absolute timestamps alongside each sampled frame, grounding the reasoning process in a precise temporal reference.

πŸ“₯ Input Representation

Timestamped Sequence Input Illustration
Figure 2: Illustration of timestamped video sequence input.

Each video is interleaved with precise time markers, where each timestamp is wrapped by dedicated special tokens (<|time_start|> … <|time_end|>) that explicitly anchor the temporal location of every visual frame:

<|im_start|><|vision_start|>
<|time_start|>0.0 seconds<|time_end|><|image_pad|>
<|time_start|>1.2 seconds<|time_end|><|image_pad|>
<|time_start|>2.3 seconds<|time_end|><|image_pad|>
...
<|vision_end|>The video shows a dynamic scene with continuous actions...<|im_end|>

🌟 Why this matters:

  • Adaptability to Variable FPS: The use of explicit timestamps allows the model to handle non-uniform sampling rates without loss of temporal context.
  • Precise Temporal Analysis: Absolute time unlocks fine-grained action localization, grounding every response in exact temporal coordinates.
  • Motion Dynamics: By exposing time intervals ($dt$), the model can reason about movement physics, enabling accurate estimation of velocity, acceleration, and trajectory.

🧬 Cross-attention RoPE (XRoPE)

MOSS-VL utilizes Cross-attention Rotary Position Embedding (XRoPE), tailored to its cross-attention based vision–language architecture. This mechanism maps text tokens and video patches into a unified 3D coordinate space defined by Time (t), Height (h), and Width (w).

MOSS-VL mRoPE Architecture Illustration
Figure 3: MOSS-VL with Cross-attention RoPE (XRoPE).

To optimize cross-modal alignment, XRoPE is injected into the vision Key (K) for position-awareness while leaving the Value (V) untouched to preserve feature fidelity. In parallel, it is applied to the text Query (Q), allowing the model to probe arbitrary spatio-temporal regions through direct coordinate alignment.

🌟 Why this matters

  • Unified Modality Modeling β€” By expressing time as a shared dimension across both language and video, XRoPE enables seamless, cohesive video-text reasoning within a single coordinate system.
  • Precise Grounding β€” Aligned ($t, h, w$) coordinates empower the model to localize small objects and transient actions anywhere in the 3D video volume β€” down to the patch and the moment.
  • Dynamic Input Support β€” The 3D grid natively accommodates arbitrary aspect ratios and resolutions, eliminating the need for fixed-length padding or rigid input constraints.

🎬 Demo

Videocounting_demo.mp4
Videocaption_demo.mp4

For more examples, please visit our Interactive Demo Page πŸš€

πŸ“Š Training Strategy

MOSS-VL is trained using a multi-stage approach to progressively build multimodal capabilities.

MOSS-VL Training Data Distribution
Figure 4: Overall training data distribution of MOSS-VL.

Pre-training(PTοΌ‰

MOSS-VL is pre-trained via a systematic four-stage curriculum that progressively builds up multimodal capabilities from the ground up:

  • Stage 1 β€” Vision-Language Alignment β€” Establishes the initial bridge between visual features and the language space. Training on large-scale image-text pairs, the model learns to associate visual concepts with their textual counterparts while developing foundational OCR skills for text-in-image understanding.

  • Stage 2 β€” Large-Scale Multimodal Pre-training β€” Scales up exposure to massive, diverse multimodal corpora, broadening the model's grasp of world knowledge and complex scenes β€” laying a robust foundation for general-purpose intelligence and high-resolution perception. In addition, short video clips are introduced at this stage to seed preliminary video understanding.

  • Stage 3 β€” High-Quality Multimodal Pre-training β€” Elevates overall model quality by training on large volumes of high-quality perception, understanding, and reasoning data. This phase combines fine-grained image perception, complex multi-image comprehension, and high-fidelity video reasoning to sharpen the model's ability to capture intricate visual details and master temporal relationships across rich multimodal inputs.

  • Stage 4 β€” Annealing & Long-Context Extrapolation β€” Stretches the model's horizon toward long-form video understanding, while a carefully designed annealing schedule trains on curated, top-tier multimodal data to push final performance to its peak.

Stage Strategy Data Composition
1 Vision-Language Alignment
2 Large-Scale Multimodal Pre-training
3 High-Quality Multimodal Pre-training
4 Annealing & Long-Context Extrapolation

Supervised Fine-Tuning (SFT)

Building on the pre-trained foundation, MOSS-VL is further refined through Supervised Fine-Tuning (SFT) to align with human intent and unlock its full interactive and instruction-following capabilities.

MOSS-VL SFT Data Composition
Figure 5: Data composition of MOSS-VL SFT.

Reinforcement Learning from Human Feedback (RLHF)

Note

MOSS-VL is currently undergoing RLHF training. Stay tuned for updates.


πŸ“Š Evaluation Results

We conducted a comprehensive evaluation of MOSS-VL across four key dimensions: Multimodal Perception, Multimodal Reasoning,Document/OCR, and Video Understanding. The results demonstrate that MOSS-VL achieves outstanding performance, particularly excelling in general multimodal perception and complex video analysis.

Overall Performance

The table below reports benchmark scores on a 0–100 scale. Across the board, MOSS-VL consistently ranks first or second when compared against industry-leading baselines such as Qwen2.5-VL and Qwen3-VL.

MOSS-VL Benchmark Comparison
Figure 6: Detailed benchmark comparison between MOSS-VL and Qwen series.

Key Highlights

  • πŸš€ Leading Video Intelligence: MOSS-VL achieves a score of 65.8 in Video Understanding, significantly outperforming Qwen3-VL (+2pts). It shows exceptional temporal consistency and action recognition capabilities across benchmarks like VideoMME, MLVU, EgoSchema, and VSI-bench (where it outperforms Qwen3-VL-8B-Instruct by 8.3 points).
  • πŸ‘οΈ Outstanding Multimodal Perception: MOSS-VL delivers excellent general image-text understanding, shining in fine-grained object recognition and spatial reasoning on benchmarks like BLINK and MMBench.
  • 🧠 Robust Multimodal Reasoning: MOSS-VL demonstrates solid logical inference, staying highly competitive with the latest Qwen series on challenging reasoning suites such as CVBench and VisuLogic.
  • πŸ“„ Reliable Document Understanding: While the model is primarily optimized for general perception and video, MOSS-VL still delivers 83.9 on OCR and document analysis, ensuring dependable extraction of text and structured information.

Benchmark Analysis

The chart below visualizes MOSS-VL's balanced and well-rounded capability profile across 30+ specialized benchmarks. Represented by the solid blue region, MOSS-VL achieves the broadest overall coverage, with particularly strong showings in the Video Understanding and Multimodal Perception quadrants.

MOSS-VL Evaluation Radar
Figure 7: Benchmark analysis of MOSS-VL.


πŸš€ Quick Start

Environment Setup

conda create -n moss_vl python=3.12 pip -y
conda activate moss_vl
pip install -i https://pypi.org/simple --no-build-isolation -r requirements.txt

Run Inference

For complete runnable examples and demo assets, see inference/README.md.

import queue
import threading
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "/path/to/dummy-checkpoint"

processor = AutoProcessor.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

query = {
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "path/to/example.jpg"},
                {"type": "text", "text": "Describe this image."},
            ],
        }
    ],
    "media_kwargs": {},
    "generate_kwargs": {
        "max_new_tokens": 256,
        "do_sample": False,
        "vision_chunked_length": 64,
    },
}

input_queue = queue.Queue()
output_queue = queue.Queue()
worker = threading.Thread(
    target=model.offline_generate,
    args=(processor, input_queue, output_queue),
    kwargs={"vision_chunked_length": 64},
    daemon=True,
)
worker.start()

input_queue.put(query)
text_chunks = []
while True:
    item = output_queue.get()
    if item in {"<|round_start|>"}:
        continue
    if item == "<|round_end|>":
        break
    text_chunks.append(item)

print("".join(text_chunks))

input_queue.put({"stop_offline_generate": True})
worker.join()

For simple batched offline inference, you can also use offline_batch_generate:

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

checkpoint = "/path/to/dummy-checkpoint"

processor = AutoProcessor.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    frame_extract_num_threads=1,
)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

queries = [
    {
        "messages": [
            {
                "role": "user",
                "content": [{"type": "text", "text": "Describe sample A."}],
            }
        ],
        "media_kwargs": {},
        "generate_kwargs": {"max_new_tokens": 256, "do_sample": False},
    },
    {
        "messages": [
            {
                "role": "user",
                "content": [{"type": "text", "text": "Describe sample B."}],
            }
        ],
        "media_kwargs": {},
        "generate_kwargs": {"max_new_tokens": 256, "do_sample": False},
    },
]

with torch.no_grad():
    result = model.offline_batch_generate(
        processor,
        queries,
        vision_chunked_length=64,
    )

texts = [item["text"] for item in result["results"]]
print(texts)

Fine-Tuning

We provide a lightweight SFT framework built on HuggingFace transformers.Trainer. It supports full-parameter training, LoRA, with the vision encoder, language model, and LM head independently controllable.

# Full-parameter SFT (vision encoder frozen by default)
bash mossvl_finetune/scripts/run_sft.sh

# LoRA SFT
pip install -i https://pypi.org/simple peft
bash mossvl_finetune/scripts/run_sft_lora.sh

Training data uses a simple JSON format compatible with the inference query structure β€” just add a response field:

[
  {
    "prompt": "Describe this image.",
    "response": "A beautiful landscape with mountains.",
    "images": ["path/to/image.jpg"],
    "videos": []
  }
]

Multi-turn conversations are also supported. See mossvl_finetune/README.md for full documentation.


πŸ“₯ Model Download

Model πŸ€—Download Link πŸ€–ModelScope Link
MOSS-VL-Base-0408 HuggingFace ModelScope
MOSS-VL-Instruct-0408 HuggingFace ModelScope

πŸ“‘ Roadmap & TODO List

βœ… Milestones

  • Core Architecture: Implementation of Cross-attention RoPE (XRoPE).
  • High-performance Infra: Integrated Megatron-LM + CUDA Flash Attention 3.
  • Model Release: Open-sourced MOSS-VL-Base and MOSS-VL-Instruct.
  • Inference: Inference code for both image and video understanding.

πŸš€ Upcoming

  • Training Engine: Full training code for MOSS-VL.
  • Real-time Capabilities: Specialized Real-time Video Understanding Model.
  • RL Post-training: Reinforcement Learning for MOSS-VL series.
  • Documentation: Comprehensive Technical Report.

🀝 Acknowledgement

We would like to express our gratitude to NVIDIA for the Megatron-LM framework and the Qwen Team for their powerful Qwen series language models, which serve as the foundation of our training infrastructure and core LLM.

πŸ“œ Citation

@misc{moss_vl_2026,
  title         = {{MOSS-VL Technical Report}},
  author        = {OpenMOSS Team},
  year          = {2026},
  howpublished  = {\url{https://github.com/OpenMOSS/MOSS-VL}},
  note          = {GitHub repository}
}

🌟 Star History

Star History Chart

Built with ❀️ by the OpenMOSS Team

About

MOSS-VL is the core multimodal model series within the OpenMOSS ecosystem, dedicated to visual understanding.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors