Intel llm-scaler-vllm Beta 1.2 Brings Support For New AI Models On Arc Graphics
([Intel] 31 Minutes Ago
llm-scaler-vllm 1.2 beta)
- Reference: 0001598424
- News link: https://www.phoronix.com/news/Intel-llm-scaler-vllm-1.2-beta
- Source link:
Following yesterday's release of [1]a new llm-scaler-omni beta there is now a new beta feature release of llm-scaler-vllm that provides the Intel-optimized version of vLLM within a Docker container that is set and ready to go for AI on modern Arc Graphics hardware. With today's llm-scaler-vllm 1.2 beta release there is support for a variety of additional large language models (LLMs) and other improvements.
Going the route of llm-scaler-vllm continues to be Intel's preferred choice for customers to leverage vLLM for AI workloads on their discrete graphics hardware. With this new llm-scaler-vllm 1.2 beta release there is support for new models and other enhancements to benefit the Intel vLLM experience:
- Fix 72-hour hang issue
- MoE-Int4 support for Qwen3-30B-A3B
- Bpe-Qwen tokenizer support
- Enable Qwen3-VL Dense/MoE models
- Enable Qwen3-Omni models
- MinerU 2.5 Support
- Enable whisper transcription models
- Fix minicpmv4.5 OOM issue and output error
- Enable ERNIE-4.5-vl models
- Enable Glyph based GLM-4.1V-9B-Base
- Attention kernel optimizations for decoding phases for all workloads (>10% e2e throughput on 10+ models with all in/out seq length)
- Gpt-oss 20B and 120B support in mxfp4 with optimized performance
- MoE models optimizations, output throughput:Qwen3-30B-A3B 2.6x e2e improvement; DeeSeek-V2-lite 1.5x improvement.
- New models: added 8 multi-modality models, image/video are supported.
- vLLM 0.10.2 with new features: P/D disaggregation(experimental), tooling, reasoning output, structured output,
- fp16/bf16 gemm optimizations for batch size 1-128. obvious improvement for small batch sizes.
- Bug fixes
This work will be especially important for next year's [2]Crescent Island hardware release.
More details on the new beta release via [3]GitHub while the llm-scaler-vllm Docker container is available via the Docker Hub container image library.
[1] https://www.phoronix.com/news/Intel-LLM-Scaler-Omni-ComfyUI
[2] https://www.phoronix.com/search/Crescent+Island
[3] https://github.com/intel/llm-scaler/releases/tag/vllm-1.2
Going the route of llm-scaler-vllm continues to be Intel's preferred choice for customers to leverage vLLM for AI workloads on their discrete graphics hardware. With this new llm-scaler-vllm 1.2 beta release there is support for new models and other enhancements to benefit the Intel vLLM experience:
- Fix 72-hour hang issue
- MoE-Int4 support for Qwen3-30B-A3B
- Bpe-Qwen tokenizer support
- Enable Qwen3-VL Dense/MoE models
- Enable Qwen3-Omni models
- MinerU 2.5 Support
- Enable whisper transcription models
- Fix minicpmv4.5 OOM issue and output error
- Enable ERNIE-4.5-vl models
- Enable Glyph based GLM-4.1V-9B-Base
- Attention kernel optimizations for decoding phases for all workloads (>10% e2e throughput on 10+ models with all in/out seq length)
- Gpt-oss 20B and 120B support in mxfp4 with optimized performance
- MoE models optimizations, output throughput:Qwen3-30B-A3B 2.6x e2e improvement; DeeSeek-V2-lite 1.5x improvement.
- New models: added 8 multi-modality models, image/video are supported.
- vLLM 0.10.2 with new features: P/D disaggregation(experimental), tooling, reasoning output, structured output,
- fp16/bf16 gemm optimizations for batch size 1-128. obvious improvement for small batch sizes.
- Bug fixes
This work will be especially important for next year's [2]Crescent Island hardware release.
More details on the new beta release via [3]GitHub while the llm-scaler-vllm Docker container is available via the Docker Hub container image library.
[1] https://www.phoronix.com/news/Intel-LLM-Scaler-Omni-ComfyUI
[2] https://www.phoronix.com/search/Crescent+Island
[3] https://github.com/intel/llm-scaler/releases/tag/vllm-1.2