Tongyi integrates Vibe Coding into all modalities, and Qwen3.5-Omni claims to achieve 215 SOTA results.

BlockBeatNews · 2026-03-30T14:06:38+00:00

Tongyi Laboratory releases the all-modal model Qwen3.5-Omni, supporting multiple input formats and generating timestamped audio and video captions, with 215 SOTA capabilities including audio and video analysis. The model can also generate code based on instructions and supports 256K context length and recognition of 113 languages.

BlockBeatNews

2026-03-30 14:06:38

Abstract generation in progress

According to 1M AI News monitoring, Tongyi Laboratory has released its multimodal model Qwen3.5-Omni, which supports text, image, audio, and audio-video inputs, and can generate fine-grained audio-video Captions with timestamps. The official says that Qwen3.5-Omni-Plus has scored 215 SOTA results on tasks such as audio and audio-video analysis, reasoning, dialogue, and translation, and its capabilities exceed Gemini-3.1-Pro.

This time, the most special increment isn’t the leaderboard, but the “naturally emerging Audio-Visual Vibe Coding capability.” Tongyi says the model was not specifically trained, yet it can already generate runnable code directly from audio-video instructions. The official also claims that the model supports 256K context, recognizes 113 languages, can handle 10 hours of audio or 1 hour of video, and natively supports WebSearch and complex Function Calls.

Qwen3.5-Omni continues the Thinker-Talker split architecture, with both components upgraded to Hybrid-Attention MoE. Tongyi has provided three Plus, Flash, and Light sizes via Alibaba Cloud’s Bailian, and launched a real-time version, Qwen3.5-Omni-Plus-Realtime.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.