Llamafile 0.8.7 Brings Fixes, Better ARM Performance & Preps For New Server
([Mozilla] 50 Minutes Ago
Llamafile 0.8.7)
- Reference: 0001472991
- News link: https://www.phoronix.com/news/Llamafile-0.8.7-Released
- Source link:
[1]Llamafile has been one of the better new initiatives out of Mozilla in recent years. Llamafile makes it easy to [2]conveniently distribute and run large language models as a single file while [3]supporting both CPU and GPU execution and all-around making AI LLMs much more approachable for end-users. Out today is Llamafile 0.8.7 with more performance optimizations and new features.
After recent Llamafile releases have been tuning the Intel/AMD AVX performance, today's Llamafile 0.8.7 release brings some ARM performance improvements. There is better performance on Arm for legacy and K-quants while also bringing optimized matrix multiplication for I-quants on AArch64.
Llamafile 0.8.7 also fixes some AMD GPU issues on Windows by now always using tinyBLAS there, improved CPU brand detection, and other fixes.
Moving forward, a new Llamafile server is preparing to roll-out. Justine Tunney mentioned in the [4]v0.8.7 release announcement on GitHub:
"It should be noted that, in future releases, we plan to introduce a new server for llamafile. This new server is being designed for performance and production-worthiness. It's not included in this release, since the new server currently only supports a tokenization endpoint. However the endpoint is capable of doing 2 million requests per second whereas with the current server, the most we've ever seen is a few thousand."
[5]This patch adding the new Llamafile server notes that it is not only much faster than before but also designed to be crash-proof, reliable, and preempting.
Llamafile continues looking great for easy to distribute and run large language models. Learn more about this open-source project via [6]Llamafile.ai .
[1] https://www.phoronix.com/search/Llamafile
[2] https://www.phoronix.com/news/Llamafile-0.7
[3] https://www.phoronix.com/news/Llamafile-0.8.5-Released
[4] https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.7
[5] https://github.com/Mozilla-Ocho/llamafile/commit/e0656ea190fa1687712c46641a721b02164e06d0
[6] https://llamafile.ai/
After recent Llamafile releases have been tuning the Intel/AMD AVX performance, today's Llamafile 0.8.7 release brings some ARM performance improvements. There is better performance on Arm for legacy and K-quants while also bringing optimized matrix multiplication for I-quants on AArch64.
Llamafile 0.8.7 also fixes some AMD GPU issues on Windows by now always using tinyBLAS there, improved CPU brand detection, and other fixes.
Moving forward, a new Llamafile server is preparing to roll-out. Justine Tunney mentioned in the [4]v0.8.7 release announcement on GitHub:
"It should be noted that, in future releases, we plan to introduce a new server for llamafile. This new server is being designed for performance and production-worthiness. It's not included in this release, since the new server currently only supports a tokenization endpoint. However the endpoint is capable of doing 2 million requests per second whereas with the current server, the most we've ever seen is a few thousand."
[5]This patch adding the new Llamafile server notes that it is not only much faster than before but also designed to be crash-proof, reliable, and preempting.
Llamafile continues looking great for easy to distribute and run large language models. Learn more about this open-source project via [6]Llamafile.ai .
[1] https://www.phoronix.com/search/Llamafile
[2] https://www.phoronix.com/news/Llamafile-0.7
[3] https://www.phoronix.com/news/Llamafile-0.8.5-Released
[4] https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.7
[5] https://github.com/Mozilla-Ocho/llamafile/commit/e0656ea190fa1687712c46641a721b02164e06d0
[6] https://llamafile.ai/
phoronix