Tongyi integrates Vibe Coding into all modalities, and Qwen3.5-Omni claims to achieve 215 SOTA results.

robot
Abstract generation in progress

According to 1M AI News monitoring, Tongyi Laboratory has released its multimodal model Qwen3.5-Omni, which supports text, image, audio, and audio-video inputs, and can generate fine-grained audio-video Captions with timestamps. The official says that Qwen3.5-Omni-Plus has scored 215 SOTA results on tasks such as audio and audio-video analysis, reasoning, dialogue, and translation, and its capabilities exceed Gemini-3.1-Pro.

This time, the most special increment isn’t the leaderboard, but the “naturally emerging Audio-Visual Vibe Coding capability.” Tongyi says the model was not specifically trained, yet it can already generate runnable code directly from audio-video instructions. The official also claims that the model supports 256K context, recognizes 113 languages, can handle 10 hours of audio or 1 hour of video, and natively supports WebSearch and complex Function Calls.

Qwen3.5-Omni continues the Thinker-Talker split architecture, with both components upgraded to Hybrid-Attention MoE. Tongyi has provided three Plus, Flash, and Light sizes via Alibaba Cloud’s Bailian, and launched a real-time version, Qwen3.5-Omni-Plus-Realtime.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin