KTransformers Adds AVX2 MoE Support For Viable Performance On CPUs Without AMX/AVX-512
([AI] 6 Hours Ago
KTransformers 0.5.3)
- Reference: 0001624365
- News link: https://www.phoronix.com/news/KTransformers-0.5.3
- Source link:
KTransformers 0.5.3 released today for this framework for efficient inferencing and fine-tuning of large language models (LLMs) with a focus on CPU-GPU heterogeneous computing. With this release, KTransformers 0.5.3 is now more applicable for CPUs lacking Advanced Matrix Extensions (AMX) and AVX-512 in now providing some AVX2-only kernels too.
KTransformers 0.5.3 introduces AVX2-only inference support for Mixture of Experts "MoE" models. There is AVX2 inference support for BF16, FP8, and GPTQ-INT4 MoE workloads. This is very beneficial for current and recent generation Intel Core (Ultra) processors lacking AVX-512 compared to the latest Xeon servers with AMX and AVX-512 or AMD Zen 4/5 CPUs also having AVX-512. Obviously though going for a CPU with AVX-512 or AMX will yield much greater CPU-based AI inferencing performance.
[1]This pull is what recently introduced the AVX2 inference support for kt-kernel. [2]This new documentation outlines running KTransformers on AVX2 processors for those interested.
KTransformers 0.5.3 also brings NUMA-aware deployment improvements for finer-grained NUMA mapping in multi-socket environments, lower idle CPU overhead, speculative decode enhancements, and various other improvements.
Those interested can find KTransformers 0.5.3 downloads and all the release details over on [3]GitHub .
[1] https://github.com/kvcache-ai/ktransformers/pull/1892
[2] https://github.com/kvcache-ai/ktransformers/blob/9b876754395fd047b6083417949410a39cf60290/doc/en/kt-kernel/AVX2-Tutorial.md
[3] https://github.com/kvcache-ai/ktransformers/releases/tag/v0.5.3
KTransformers 0.5.3 introduces AVX2-only inference support for Mixture of Experts "MoE" models. There is AVX2 inference support for BF16, FP8, and GPTQ-INT4 MoE workloads. This is very beneficial for current and recent generation Intel Core (Ultra) processors lacking AVX-512 compared to the latest Xeon servers with AMX and AVX-512 or AMD Zen 4/5 CPUs also having AVX-512. Obviously though going for a CPU with AVX-512 or AMX will yield much greater CPU-based AI inferencing performance.
[1]This pull is what recently introduced the AVX2 inference support for kt-kernel. [2]This new documentation outlines running KTransformers on AVX2 processors for those interested.
KTransformers 0.5.3 also brings NUMA-aware deployment improvements for finer-grained NUMA mapping in multi-socket environments, lower idle CPU overhead, speculative decode enhancements, and various other improvements.
Those interested can find KTransformers 0.5.3 downloads and all the release details over on [3]GitHub .
[1] https://github.com/kvcache-ai/ktransformers/pull/1892
[2] https://github.com/kvcache-ai/ktransformers/blob/9b876754395fd047b6083417949410a39cf60290/doc/en/kt-kernel/AVX2-Tutorial.md
[3] https://github.com/kvcache-ai/ktransformers/releases/tag/v0.5.3