llama.cpp verification source 2026-05-22
Some checks are pending
Copilot Setup Steps / copilot-setup-steps (push) Waiting to run
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / python type-check (push) Waiting to run
Update Operations Documentation / update-ops-docs (push) Waiting to run
Some checks are pending
Copilot Setup Steps / copilot-setup-steps (push) Waiting to run
Check Pre-Tokenizer Hashes / pre-tokenizer-hashes (push) Waiting to run
Python check requirements.txt / check-requirements (push) Waiting to run
Python Type-Check / python type-check (push) Waiting to run
Update Operations Documentation / update-ops-docs (push) Waiting to run
This commit is contained in:
60
docs/backend/BLIS.md
Normal file
60
docs/backend/BLIS.md
Normal file
@@ -0,0 +1,60 @@
|
||||
BLIS Installation Manual
|
||||
------------------------
|
||||
|
||||
BLIS is a portable software framework for high-performance BLAS-like dense linear algebra libraries. It has received awards and recognition, including the 2023 James H. Wilkinson Prize for Numerical Software and the 2020 SIAM Activity Group on Supercomputing Best Paper Prize. BLIS provides a new BLAS-like API and a compatibility layer for traditional BLAS routine calls. It offers features such as object-based API, typed API, BLAS and CBLAS compatibility layers.
|
||||
|
||||
Project URL: https://github.com/flame/blis
|
||||
|
||||
### Prepare:
|
||||
|
||||
Compile BLIS:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/flame/blis
|
||||
cd blis
|
||||
./configure --enable-cblas -t openmp,pthreads auto
|
||||
# will install to /usr/local/ by default.
|
||||
make -j
|
||||
```
|
||||
|
||||
Install BLIS:
|
||||
|
||||
```bash
|
||||
sudo make install
|
||||
```
|
||||
|
||||
We recommend using openmp since it's easier to modify the cores being used.
|
||||
|
||||
### llama.cpp compilation
|
||||
|
||||
CMake:
|
||||
|
||||
```bash
|
||||
mkdir build
|
||||
cd build
|
||||
cmake -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=FLAME ..
|
||||
make -j
|
||||
```
|
||||
|
||||
### llama.cpp execution
|
||||
|
||||
According to the BLIS documentation, we could set the following
|
||||
environment variables to modify the behavior of openmp:
|
||||
|
||||
```bash
|
||||
export GOMP_CPU_AFFINITY="0-19"
|
||||
export BLIS_NUM_THREADS=14
|
||||
```
|
||||
|
||||
And then run the binaries as normal.
|
||||
|
||||
|
||||
### Intel specific issue
|
||||
|
||||
Some might get the error message saying that `libimf.so` cannot be found.
|
||||
Please follow this [stackoverflow page](https://stackoverflow.com/questions/70687930/intel-oneapi-2022-libimf-so-no-such-file-or-directory-during-openmpi-compila).
|
||||
|
||||
### Reference:
|
||||
|
||||
1. https://github.com/flame/blis#getting-started
|
||||
2. https://github.com/flame/blis/blob/master/docs/Multithreading.md
|
||||
357
docs/backend/CANN.md
Executable file
357
docs/backend/CANN.md
Executable file
@@ -0,0 +1,357 @@
|
||||
# llama.cpp for CANN
|
||||
|
||||
- [Background](#background)
|
||||
- [News](#news)
|
||||
- [OS](#os)
|
||||
- [Hardware](#hardware)
|
||||
- [Model Supports](#model-supports)
|
||||
- [DataType Supports](#datatype-supports)
|
||||
- [Docker](#docker)
|
||||
- [Linux](#linux)
|
||||
- [Environment variable setup](#environment-variable-setup)
|
||||
- [TODO](#todo)
|
||||
|
||||
|
||||
## Background
|
||||
|
||||
**Ascend NPU** is a range of AI processors using Neural Processing Unit. It will efficiently handle matrix-matrix multiplication, dot-product and scalars.
|
||||
|
||||
**CANN** (Compute Architecture for Neural Networks) is a heterogeneous computing architecture for AI scenarios, providing support for multiple AI frameworks on the top and serving AI processors and programming at the bottom. It plays a crucial role in bridging the gap between upper and lower layers, and is a key platform for improving the computing efficiency of Ascend AI processors. Meanwhile, it offers a highly efficient and easy-to-use programming interface for diverse application scenarios, allowing users to rapidly build AI applications and services based on the Ascend platform.
|
||||
|
||||
**Llama.cpp + CANN**
|
||||
|
||||
The llama.cpp CANN backend is designed to support Ascend NPU. It utilize the ability of AscendC and ACLNN which are integrated to CANN Toolkit and kernels to using Ascend NPU directly.
|
||||
|
||||
## News
|
||||
|
||||
- 2024.11
|
||||
- Support F16 and F32 data type model for Ascend 310P NPU.
|
||||
- 2024.8
|
||||
- Support `Q4_0` and `Q8_0` data type for Ascend NPU.
|
||||
- 2024.7
|
||||
- Create CANN backend for Ascend NPU.
|
||||
|
||||
## OS
|
||||
|
||||
| OS | Status | Verified |
|
||||
|:-------:|:-------:|:----------------------------------------------:|
|
||||
| Linux | Support | Ubuntu 22.04, OpenEuler22.03 |
|
||||
|
||||
|
||||
## Hardware
|
||||
|
||||
### Ascend NPU
|
||||
|
||||
You can retrieve your Ascend device IDs using the following command:
|
||||
|
||||
```sh
|
||||
lspci -n | grep -Eo '19e5:d[0-9a-f]{3}' | cut -d: -f2
|
||||
```
|
||||
|
||||
**Devices**
|
||||
|
||||
| Device Id | Product Series | Product Models | Chip Model | Verified Status |
|
||||
|:---------:|----------------|----------------|:----------:|:---------------:|
|
||||
| d803 | Atlas A3 Train | | 910C | |
|
||||
| d803 | Atlas A3 Infer | | 910C | |
|
||||
| d802 | Atlas A2 Train | | 910B | |
|
||||
| d802 | Atlas A2 Infer | Atlas 300I A2 | 910B | Support |
|
||||
| d801 | Atlas Train | | 910 | |
|
||||
| d500 | Atlas Infer | Atlas 300I Duo | 310P | Support |
|
||||
|
||||
*Notes:*
|
||||
|
||||
- If you have trouble with Ascend NPU device, please create a issue with **[CANN]** prefix/tag.
|
||||
- If you run successfully with your Ascend NPU device, please help update the upper table.
|
||||
|
||||
|
||||
## Model Supports
|
||||
|
||||
<details>
|
||||
<summary>Text-only</summary>
|
||||
|
||||
| Model Name | FP16 | Q4_0 | Q8_0 |
|
||||
|:----------------------------|:-----:|:----:|:----:|
|
||||
| Llama-2 | √ | √ | √ |
|
||||
| Llama-3 | √ | √ | √ |
|
||||
| Mistral-7B | √ | √ | √ |
|
||||
| Mistral MOE | √ | √ | √ |
|
||||
| DBRX | - | - | - |
|
||||
| Falcon | √ | √ | √ |
|
||||
| Chinese LLaMA/Alpaca | √ | √ | √ |
|
||||
| Vigogne(French) | √ | √ | √ |
|
||||
| BERT | x | x | x |
|
||||
| Koala | √ | √ | √ |
|
||||
| Baichuan | √ | √ | √ |
|
||||
| Aquila 1 & 2 | √ | √ | √ |
|
||||
| Starcoder models | √ | √ | √ |
|
||||
| Refact | √ | √ | √ |
|
||||
| MPT | √ | √ | √ |
|
||||
| Bloom | √ | √ | √ |
|
||||
| Yi models | √ | √ | √ |
|
||||
| stablelm models | √ | √ | √ |
|
||||
| DeepSeek models | x | x | x |
|
||||
| Qwen models | √ | √ | √ |
|
||||
| PLaMo-13B | √ | √ | √ |
|
||||
| Phi models | √ | √ | √ |
|
||||
| PhiMoE | √ | √ | √ |
|
||||
| GPT-2 | √ | √ | √ |
|
||||
| Orion | √ | √ | √ |
|
||||
| InternlLM2 | √ | √ | √ |
|
||||
| CodeShell | √ | √ | √ |
|
||||
| Gemma | √ | √ | √ |
|
||||
| Mamba | √ | √ | √ |
|
||||
| Xverse | √ | √ | √ |
|
||||
| command-r models | √ | √ | √ |
|
||||
| Grok-1 | - | - | - |
|
||||
| SEA-LION | √ | √ | √ |
|
||||
| GritLM-7B | √ | √ | √ |
|
||||
| OLMo | √ | √ | √ |
|
||||
| OLMo 2 | √ | √ | √ |
|
||||
| OLMoE | √ | √ | √ |
|
||||
| Granite models | √ | √ | √ |
|
||||
| GPT-NeoX | √ | √ | √ |
|
||||
| Pythia | √ | √ | √ |
|
||||
| Snowflake-Arctic MoE | - | - | - |
|
||||
| Smaug | √ | √ | √ |
|
||||
| Poro 34B | √ | √ | √ |
|
||||
| Bitnet b1.58 models | √ | x | x |
|
||||
| Flan-T5 | √ | √ | √ |
|
||||
| Open Elm models | x | √ | √ |
|
||||
| chatGLM3-6B + ChatGLM4-9b + GLMEdge-1.5b + GLMEdge-4b | √ | √ | √ |
|
||||
| GLM-4-0414 | √ | √ | √ |
|
||||
| SmolLM | √ | √ | √ |
|
||||
| EXAONE-3.0-7.8B-Instruct | √ | √ | √ |
|
||||
| FalconMamba Models | √ | √ | √ |
|
||||
| Jais Models | - | x | x |
|
||||
| Bielik-11B-v2.3 | √ | √ | √ |
|
||||
| RWKV-6 | - | √ | √ |
|
||||
| QRWKV-6 | √ | √ | √ |
|
||||
| GigaChat-20B-A3B | x | x | x |
|
||||
| Trillion-7B-preview | √ | √ | √ |
|
||||
| Ling models | √ | √ | √ |
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Multimodal</summary>
|
||||
|
||||
| Model Name | FP16 | Q4_0 | Q8_0 |
|
||||
|:----------------------------|:-----:|:----:|:----:|
|
||||
| LLaVA 1.5 models, LLaVA 1.6 models | x | x | x |
|
||||
| BakLLaVA | √ | √ | √ |
|
||||
| Obsidian | √ | - | - |
|
||||
| ShareGPT4V | x | - | - |
|
||||
| MobileVLM 1.7B/3B models | - | - | - |
|
||||
| Yi-VL | - | - | - |
|
||||
| Mini CPM | √ | √ | √ |
|
||||
| Moondream | √ | √ | √ |
|
||||
| Bunny | √ | - | - |
|
||||
| GLM-EDGE | √ | √ | √ |
|
||||
| Qwen2-VL | √ | √ | √ |
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
|
||||
## DataType Supports
|
||||
|
||||
| DataType | 910B | 310P |
|
||||
|:----------------------:|:-------:|:-------:|
|
||||
| FP16 | Support | Support |
|
||||
| Q8_0 | Support | Partial |
|
||||
| Q4_0 | Support | Partial |
|
||||
| BF16 | Support | |
|
||||
|
||||
> **310P note**
|
||||
> - `Q8_0`: data transform / buffer path is implemented, and `GET_ROWS` is supported, but quantized `MUL_MAT` / `MUL_MAT_ID` are not supported.
|
||||
> - `Q4_0`: data transform / buffer path is implemented, but quantized `MUL_MAT` / `MUL_MAT_ID` are not supported.
|
||||
|
||||
## Docker
|
||||
|
||||
### Build Images
|
||||
You can get a image with llama.cpp in one command.
|
||||
```sh
|
||||
docker build -t llama-cpp-cann -f .devops/llama-cli-cann.Dockerfile .
|
||||
```
|
||||
|
||||
### Run container
|
||||
|
||||
```sh
|
||||
# Find all cards.
|
||||
npu-smi info
|
||||
|
||||
# Select the cards that you want to use, make sure these cards are not used by someone.
|
||||
# Following using cards of device0.
|
||||
docker run --name llamacpp \
|
||||
--device /dev/davinci0 \
|
||||
--device /dev/davinci_manager \
|
||||
--device /dev/devmm_svm \
|
||||
--device /dev/hisi_hdc \
|
||||
-v /usr/local/dcmi:/usr/local/dcmi \
|
||||
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
|
||||
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
|
||||
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
|
||||
-v /PATH_TO_YOUR_MODELS/:/app/models \
|
||||
-it llama-cpp-cann \
|
||||
-m /app/models/MODEL_PATH \
|
||||
-ngl 32 \
|
||||
-p "Building a website can be done in 10 simple steps:"
|
||||
```
|
||||
|
||||
*Notes:*
|
||||
|
||||
- You may need to install Ascend Driver and firmware on the **host** machine *(Please refer to the [Linux configuration](#linux) for details)*.
|
||||
|
||||
## Linux
|
||||
|
||||
### I. Setup Environment
|
||||
|
||||
1. **Configure Ascend user and group**
|
||||
|
||||
```sh
|
||||
sudo groupadd HwHiAiUser
|
||||
sudo useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser -s /bin/bash
|
||||
sudo usermod -aG HwHiAiUser $USER
|
||||
```
|
||||
|
||||
2. **Install dependencies**
|
||||
|
||||
**Ubuntu/Debian:**
|
||||
```sh
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y gcc python3 python3-pip linux-headers-$(uname -r)
|
||||
```
|
||||
|
||||
**RHEL/CentOS:**
|
||||
```sh
|
||||
sudo yum makecache
|
||||
sudo yum install -y gcc python3 python3-pip kernel-headers-$(uname -r) kernel-devel-$(uname -r)
|
||||
```
|
||||
|
||||
3. **Install CANN (driver + toolkit)**
|
||||
|
||||
> The `Ascend-cann` package includes both the driver and toolkit.
|
||||
> `$ARCH` can be `x86_64` or `aarch64`, `$CHIP` can be `910b` or `310p`.
|
||||
|
||||
```sh
|
||||
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.5.T63/Ascend-cann_8.5.0_linux-$ARCH.run
|
||||
sudo bash ./Ascend-cann_8.5.0_linux-$ARCH.run --install
|
||||
|
||||
wget https://ascend-repo.obs.cn-east-2.myhuaweicloud.com/CANN/CANN%208.5.T63/Ascend-cann-$CHIP-ops_8.5.0_linux-$ARCH.run
|
||||
sudo bash ./Ascend-cann-$CHIP-ops_8.5.0_linux-$ARCH.run --install
|
||||
```
|
||||
|
||||
4. **Verify installation**
|
||||
|
||||
```sh
|
||||
npu-smi info
|
||||
```
|
||||
|
||||
If device information is displayed correctly, the driver is functioning properly.
|
||||
|
||||
```sh
|
||||
# Set environment variables (adjust path if needed)
|
||||
source /usr/local/Ascend/cann/set_env.sh
|
||||
|
||||
python3 -c "import acl; print(acl.get_soc_name())"
|
||||
```
|
||||
|
||||
If the command outputs the chip model, the installation was successful.
|
||||
|
||||
### II. Build llama.cpp
|
||||
|
||||
```sh
|
||||
cmake -B build -DGGML_CANN=on -DCMAKE_BUILD_TYPE=release
|
||||
cmake --build build --config release
|
||||
```
|
||||
|
||||
### III. Run the inference
|
||||
|
||||
1. **Retrieve and prepare model**
|
||||
|
||||
You can refer to the general [*Obtaining and quantizing models*](../../README.md#obtaining-and-quantizing-models) guide for model prepration.
|
||||
|
||||
**Notes**:
|
||||
|
||||
- CANN backend only supports FP16/Q4_0/Q8_0 models currently.
|
||||
|
||||
2. **Launch inference**
|
||||
|
||||
There are two device selection modes:
|
||||
|
||||
- Single device: Use one device target specified by the user.
|
||||
- Multiple devices: Automatically choose the devices with the same backend.
|
||||
|
||||
| Device selection | Parameter |
|
||||
|:----------------:|:--------------------------------------:|
|
||||
| Single device | --split-mode none --main-gpu DEVICE_ID |
|
||||
| Multiple devices | --split-mode layer (default) |
|
||||
|
||||
Examples:
|
||||
|
||||
- Use device 0:
|
||||
|
||||
```sh
|
||||
./build/bin/llama-cli -m path_to_model -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
|
||||
```
|
||||
|
||||
- Use multiple devices:
|
||||
|
||||
```sh
|
||||
./build/bin/llama-cli -m path_to_model -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm layer
|
||||
```
|
||||
|
||||
### **GitHub contribution**:
|
||||
Please add the **[CANN]** prefix/tag in issues/PRs titles to help the CANN-team check/address them without delay.
|
||||
|
||||
## Updates
|
||||
### Basic Flash Attention Support
|
||||
The basic FA kernel with aclnnops has been added in aclnn_ops.cpp.
|
||||
Currently, the FA only supports the cases with FP16 KV tensors and NO logit softcap.
|
||||
Since the aclnn interface for flash attention cannot support the logit softcap, we will only update the quantized version in the future.
|
||||
|
||||
Authors from Peking University: Bizhao Shi (bshi@pku.edu.cn), Yuxin Yang (yxyang@pku.edu.cn), Ruiyang Ma (ruiyang@stu.pku.edu.cn), and Guojie Luo (gluo@pku.edu.cn).
|
||||
|
||||
We would like to thank Tuo Dai, Shanni Li, and all of the project maintainers from Huawei Technologies Co., Ltd for their help during the code development and pull request.
|
||||
|
||||
## Environment variable setup
|
||||
|
||||
### GGML_CANN_MEM_POOL
|
||||
|
||||
Specifies the memory pool management strategy, Default is vmm.
|
||||
|
||||
- vmm: Utilizes a virtual memory manager pool. If hardware support for VMM is unavailable, falls back to the legacy (leg) memory pool.
|
||||
|
||||
- prio: Employs a priority queue-based memory pool management.
|
||||
|
||||
- leg: Uses a fixed-size buffer pool.
|
||||
|
||||
### GGML_CANN_DISABLE_BUF_POOL_CLEAN
|
||||
|
||||
Controls automatic cleanup of the memory pool. This option is only effective when using the prio or leg memory pool strategies.
|
||||
|
||||
### GGML_CANN_WEIGHT_NZ
|
||||
|
||||
Converting the matmul weight format from ND to NZ to improve performance. Enabled by default.
|
||||
|
||||
### GGML_CANN_ACL_GRAPH
|
||||
|
||||
Operators are executed using ACL graph execution, rather than in op-by-op (eager) mode. Enabled by default. This option is only effective if `USE_ACL_GRAPH` was enabled at compilation time. To enable it, recompile using:
|
||||
|
||||
```sh
|
||||
cmake -B build -DGGML_CANN=on -DCMAKE_BUILD_TYPE=release -DUSE_ACL_GRAPH=ON
|
||||
cmake --build build --config release
|
||||
```
|
||||
|
||||
### GGML_CANN_GRAPH_CACHE_CAPACITY
|
||||
|
||||
Maximum number of compiled CANN graphs kept in the LRU cache, default is 12. When the number of cached graphs exceeds this capacity, the least recently used graph will be evicted.
|
||||
|
||||
### GGML_CANN_PREFILL_USE_GRAPH
|
||||
|
||||
Enable ACL graph execution during the prefill stage, default is false. This option is only effective when FA is enabled.
|
||||
|
||||
### GGML_CANN_OPERATOR_FUSION
|
||||
|
||||
Enable operator fusion during computation, default is false. This option fuses compatible operators (e.g., ADD + RMS_NORM) to reduce overhead and improve performance.
|
||||
283
docs/backend/CUDA-FEDORA.md
Normal file
283
docs/backend/CUDA-FEDORA.md
Normal file
@@ -0,0 +1,283 @@
|
||||
# Setting Up CUDA on Fedora
|
||||
|
||||
In this guide we setup [Nvidia CUDA](https://docs.nvidia.com/cuda/) in a toolbox container. This guide is applicable for:
|
||||
|
||||
- [Fedora Workstation](https://fedoraproject.org/workstation/)
|
||||
- [Atomic Desktops for Fedora](https://fedoraproject.org/atomic-desktops/)
|
||||
- [Fedora Spins](https://fedoraproject.org/spins)
|
||||
- [Other Distributions](https://containertoolbx.org/distros/), including `Red Hat Enterprise Linux >= 8.5`, `Arch Linux`, and `Ubuntu`.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Prerequisites](#prerequisites)
|
||||
- [Using the Fedora 41 CUDA Repository](#using-the-fedora-41-cuda-repository)
|
||||
- [Creating a Fedora Toolbox Environment](#creating-a-fedora-toolbox-environment)
|
||||
- [Installing Essential Development Tools](#installing-essential-development-tools)
|
||||
- [Adding the CUDA Repository](#adding-the-cuda-repository)
|
||||
- [Installing Nvidia Driver Libraries](#installing-nvidia-driver-libraries)
|
||||
- [Installing the CUDA Meta-Package](#installing-the-cuda-meta-package)
|
||||
- [Configuring the Environment](#configuring-the-environment)
|
||||
- [Verifying the Installation](#verifying-the-installation)
|
||||
- [Conclusion](#conclusion)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
- [Additional Notes](#additional-notes)
|
||||
- [References](#references)
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- **Toolbox Installed on the Host System** `Fedora Silverblue` and `Fedora Workstation` both have toolbox by default, other distributions may need to install the [toolbox package](https://containertoolbx.org/install/).
|
||||
- **NVIDIA Drivers and Graphics Card installed on Host System (recommended)** To run CUDA program, such as `llama.cpp`, the host should be setup to access your NVIDIA hardware. Fedora Hosts can use the [RPM Fusion Repository](https://rpmfusion.org/Howto/NVIDIA).
|
||||
- **Internet connectivity** to download packages.
|
||||
|
||||
### Using the Fedora 41 CUDA Repository
|
||||
|
||||
The latest release is 41.
|
||||
|
||||
- [Fedora 41 CUDA Repository](https://developer.download.nvidia.com/compute/cuda/repos/fedora41/x86_64/)
|
||||
|
||||
**Note:** We recommend using a toolbox environment to prevent system conflicts.
|
||||
|
||||
## Creating a Fedora Toolbox Environment
|
||||
|
||||
This guide focuses on Fedora hosts, but with small adjustments, it can work for other hosts. Using the Fedora Toolbox allows us to install the necessary packages without affecting the host system.
|
||||
|
||||
**Note:** Toolbox is available for other systems, and even without Toolbox, it is possible to use Podman or Docker.
|
||||
|
||||
1. **Create a Fedora 41 Toolbox:**
|
||||
|
||||
```bash
|
||||
toolbox create --image registry.fedoraproject.org/fedora-toolbox:41 --container fedora-toolbox-41-cuda
|
||||
```
|
||||
|
||||
2. **Enter the Toolbox:**
|
||||
|
||||
```bash
|
||||
toolbox enter --container fedora-toolbox-41-cuda
|
||||
```
|
||||
|
||||
Inside the toolbox, you have root privileges and can install packages without affecting the host system.
|
||||
|
||||
## Installing Essential Development Tools
|
||||
|
||||
1. **Synchronize the DNF Package Manager:**
|
||||
|
||||
```bash
|
||||
sudo dnf distro-sync
|
||||
```
|
||||
|
||||
2. **Install **Vim** the default text editor (Optional):**
|
||||
|
||||
```bash
|
||||
sudo dnf install vim-default-editor --allowerasing
|
||||
```
|
||||
|
||||
The `--allowerasing` flag will allow the removal of the conflicting `nano-default-editor` package.
|
||||
|
||||
3. **Install Development Tools and Libraries:**
|
||||
|
||||
```bash
|
||||
sudo dnf install @c-development @development-tools cmake
|
||||
```
|
||||
|
||||
This installs essential packages for compiling software, including `gcc`, `make`, and other development headers.
|
||||
|
||||
## Adding the CUDA Repository
|
||||
|
||||
Add the NVIDIA CUDA repository to your DNF configuration:
|
||||
|
||||
```bash
|
||||
sudo dnf config-manager addrepo --from-repofile=https://developer.download.nvidia.com/compute/cuda/repos/fedora41/x86_64/cuda-fedora41.repo
|
||||
```
|
||||
|
||||
After adding the repository, synchronize the package manager again:
|
||||
|
||||
```bash
|
||||
sudo dnf distro-sync
|
||||
```
|
||||
|
||||
## Installing Nvidia Driver Libraries
|
||||
|
||||
First, we need to detect if the host is supplying the [NVIDIA driver libraries into the toolbox](https://github.com/containers/toolbox/blob/main/src/pkg/nvidia/nvidia.go):
|
||||
|
||||
```bash
|
||||
ls -la /usr/lib64/libcuda.so.1
|
||||
```
|
||||
|
||||
### If *`libcuda.so.1`* is missing:
|
||||
|
||||
```
|
||||
ls: cannot access '/usr/lib64/libcuda.so.1': No such file or directory
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
The host dose not supply the CUDA drivers, **install them now:**
|
||||
|
||||
#### Install the Nvidia Driver Libraries on Guest:
|
||||
|
||||
```bash
|
||||
sudo dnf install nvidia-driver-cuda nvidia-driver-libs nvidia-driver-cuda-libs nvidia-persistenced
|
||||
```
|
||||
|
||||
### If *`libcuda.so.1`* exists:
|
||||
```
|
||||
lrwxrwxrwx. 1 root root 21 Mar 24 11:26 /usr/lib64/libcuda.so.1 -> libcuda.so.570.133.07
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
The host is supply the CUDA drivers, **we need to update the guest RPM Database accordingly:**
|
||||
|
||||
#### Update the Toolbox RPM Database to include the Host-Supplied Libraries:
|
||||
|
||||
Note: we do not actually install the libraries, we just update the DB so that the guest system knows they are supplied by the host.
|
||||
|
||||
##### 1. Download `nvidia-` parts that are supplied by the host RPM's (with dependencies)
|
||||
|
||||
```bash
|
||||
sudo dnf download --destdir=/tmp/nvidia-driver-libs --resolve --arch x86_64 nvidia-driver-cuda nvidia-driver-libs nvidia-driver-cuda-libs nvidia-persistenced
|
||||
```
|
||||
|
||||
##### 2. Update the RPM database to assume the installation of these packages.
|
||||
|
||||
```bash
|
||||
sudo rpm --install --verbose --hash --justdb /tmp/nvidia-driver-libs/*
|
||||
```
|
||||
|
||||
**Note:**
|
||||
|
||||
- The `--justdb` option only updates the RPM database, without touching the filesystem elsewhere.
|
||||
|
||||
##### Check that the RPM Database has been correctly updated:
|
||||
|
||||
**Note:** This is the same command as in the *"Install the Nvidia Driver Libraries on Guest"* for if *`libcuda.so.1`* was missing.
|
||||
|
||||
|
||||
```bash
|
||||
sudo dnf install nvidia-driver-cuda nvidia-driver-libs nvidia-driver-cuda-libs nvidia-persistenced
|
||||
```
|
||||
|
||||
*(this time it will not install anything, as the database things that these packages are already installed)*
|
||||
|
||||
```
|
||||
Updating and loading repositories:
|
||||
Repositories loaded.
|
||||
Package "nvidia-driver-cuda-3:570.124.06-1.fc41.x86_64" is already installed.
|
||||
Package "nvidia-driver-libs-3:570.124.06-1.fc41.x86_64" is already installed.
|
||||
Package "nvidia-driver-cuda-libs-3:570.124.06-1.fc41.x86_64" is already installed.
|
||||
Package "nvidia-persistenced-3:570.124.06-1.fc41.x86_64" is already installed.
|
||||
|
||||
Nothing to do.
|
||||
```
|
||||
|
||||
## Installing the CUDA Meta-Package
|
||||
|
||||
Now that the driver libraries are installed, proceed to install CUDA:
|
||||
|
||||
```bash
|
||||
sudo dnf install cuda
|
||||
```
|
||||
|
||||
This installs the CUDA toolkit and associated packages.
|
||||
|
||||
## Configuring the Environment
|
||||
|
||||
To use CUDA, add its binary directory to your system's `PATH`.
|
||||
|
||||
1. **Create a Profile Script:**
|
||||
|
||||
```bash
|
||||
sudo sh -c 'echo "export PATH=\$PATH:/usr/local/cuda/bin" >> /etc/profile.d/cuda.sh'
|
||||
```
|
||||
|
||||
**Explanation:**
|
||||
|
||||
- We add to `/etc/profile.d/` as the `/etc/` folder is unique to this particular container, and is not shared with other containers or the host system.
|
||||
- The backslash `\` before `$PATH` ensures the variable is correctly written into the script.
|
||||
|
||||
2. **Make the Script Executable:**
|
||||
|
||||
```bash
|
||||
sudo chmod +x /etc/profile.d/cuda.sh
|
||||
```
|
||||
|
||||
3. **Source the Script to Update Your Environment:**
|
||||
|
||||
```bash
|
||||
source /etc/profile.d/cuda.sh
|
||||
```
|
||||
|
||||
**Note:** This command updates your current shell session with the new `PATH`. The `/etc/profile.d/cuda.sh` script ensures that the CUDA binaries are available in your `PATH` for all future sessions.
|
||||
|
||||
## Verifying the Installation
|
||||
|
||||
To confirm that CUDA is correctly installed and configured, check the version of the NVIDIA CUDA Compiler (`nvcc`):
|
||||
|
||||
```bash
|
||||
nvcc --version
|
||||
```
|
||||
|
||||
You should see output similar to:
|
||||
|
||||
```
|
||||
nvcc: NVIDIA (R) Cuda compiler driver
|
||||
Copyright (c) 2005-2025 NVIDIA Corporation
|
||||
Built on Fri_Feb_21_20:23:50_PST_2025
|
||||
Cuda compilation tools, release 12.8, V12.8.93
|
||||
Build cuda_12.8.r12.8/compiler.35583870_0
|
||||
```
|
||||
|
||||
This output confirms that the CUDA compiler is accessible and indicates the installed version.
|
||||
|
||||
## Conclusion
|
||||
|
||||
You have successfully set up CUDA on Fedora within a toolbox environment using the Fedora 41 CUDA repository. By manually updating the RPM db and configuring the environment, you can develop CUDA applications without affecting your host system.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
- **Installation Failures:**
|
||||
|
||||
- If you encounter errors during installation, carefully read the error messages. They often indicate conflicting files or missing dependencies.
|
||||
- You may use the `--excludepath` option with `rpm` to exclude conflicting files during manual RPM installations.
|
||||
|
||||
- **Rebooting the Container:**
|
||||
|
||||
- Sometimes there may be a bug in the NVIDIA driver host passthrough (such as missing a shared library). Rebooting the container may solve this issue:
|
||||
|
||||
```bash
|
||||
# on the host system
|
||||
podman container restart --all
|
||||
```
|
||||
|
||||
- **Environment Variables Not Set:**
|
||||
- If `nvcc` is not found after installation, ensure that `/usr/local/cuda/bin` is in your `PATH`.
|
||||
- Run `echo $PATH` to check if the path is included.
|
||||
- Re-source the profile script or open a new terminal session.
|
||||
|
||||
## Additional Notes
|
||||
|
||||
- **Updating CUDA in the Future:**
|
||||
|
||||
- Keep an eye on the official NVIDIA repositories for updates to your Fedora version.
|
||||
- When an updated repository becomes available, adjust your `dnf` configuration accordingly.
|
||||
|
||||
- **Building `llama.cpp`:**
|
||||
|
||||
- With CUDA installed, you can follow these [build instructions for `llama.cpp`](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) to compile it with CUDA support.
|
||||
- Ensure that any CUDA-specific build flags or paths are correctly set in your build configuration.
|
||||
|
||||
- **Using the Toolbox Environment:**
|
||||
- The toolbox environment is isolated from your host system, which helps prevent conflicts.
|
||||
- Remember that system files and configurations inside the toolbox are separate from the host. By default the home directory of the user is shared between the host and the toolbox.
|
||||
|
||||
---
|
||||
|
||||
**Disclaimer:** Manually installing and modifying system packages can lead to instability of the container. The above steps are provided as a guideline and may need adjustments based on your specific system configuration. Always back up important data before making significant system changes, especially as your home folder is writable and shared with he toolbox.
|
||||
|
||||
**Acknowledgments:** Special thanks to the Fedora community and NVIDIA documentation for providing resources that assisted in creating this guide.
|
||||
|
||||
## References
|
||||
|
||||
- [Fedora Toolbox Documentation](https://docs.fedoraproject.org/en-US/fedora-silverblue/toolbox/)
|
||||
- [NVIDIA CUDA Installation Guide](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html)
|
||||
- [Podman Documentation](https://podman.io/get-started)
|
||||
|
||||
---
|
||||
282
docs/backend/OPENCL.md
Normal file
282
docs/backend/OPENCL.md
Normal file
@@ -0,0 +1,282 @@
|
||||
# llama.cpp for OpenCL
|
||||
|
||||
- [Background](#background)
|
||||
- [OS](#os)
|
||||
- [Hardware](#hardware)
|
||||
- [DataType Supports](#datatype-supports)
|
||||
- [Model Preparation](#model-preparation)
|
||||
- [CMake Options](#cmake-options)
|
||||
- [Android](#android)
|
||||
- [Windows 11 Arm64](#windows-11-arm64)
|
||||
- [Linux](#Linux)
|
||||
- [Known Issue](#known-issues)
|
||||
- [TODO](#todo)
|
||||
|
||||
## Background
|
||||
|
||||
OpenCL (Open Computing Language) is an open, royalty-free standard for cross-platform, parallel programming of diverse accelerators found in supercomputers, cloud servers, personal computers, mobile devices and embedded platforms. OpenCL specifies a programming language (based on C99) for programming these devices and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. Similar to CUDA, OpenCL has been widely used to program GPUs and is supported by most GPU vendors.
|
||||
|
||||
### Llama.cpp + OpenCL
|
||||
|
||||
The llama.cpp OpenCL backend is designed to enable llama.cpp on **Qualcomm Adreno GPU** firstly via OpenCL. Thanks to the portabilty of OpenCL, the OpenCL backend can also run on certain Intel GPUs such as those that do not have [SYCL](/docs/backend/SYCL.md) support although the performance is not optimal.
|
||||
|
||||
## OS
|
||||
|
||||
| OS | Status | Verified |
|
||||
|---------|---------|------------------------------------------------|
|
||||
| Android | Support | Snapdragon 8 Gen 3, Snapdragon 8 Elite |
|
||||
| Windows | Support | Windows 11 Arm64 with Snapdragon X Elite |
|
||||
| Linux | Support | Ubuntu 22.04 WSL2 with Intel 12700H |
|
||||
|
||||
## Hardware
|
||||
|
||||
### Adreno GPU
|
||||
|
||||
**Verified devices**
|
||||
|
||||
| Adreno GPU | Status |
|
||||
|:------------------------------------:|:-------:|
|
||||
| Adreno 750 (Snapdragon 8 Gen 3) | Support |
|
||||
| Adreno 830 (Snapdragon 8 Elite) | Support |
|
||||
| Adreno X85 (Snapdragon X Elite) | Support |
|
||||
|
||||
> A6x GPUs with a recent driver and compiler are supported; they are usually found in IoT platforms.
|
||||
However, A6x GPUs in phones are likely not supported due to the outdated driver and compiler.
|
||||
|
||||
## DataType Supports
|
||||
|
||||
| DataType | Status |
|
||||
|:----------------------:|:--------------------------:|
|
||||
| Q4_0 | Support |
|
||||
| Q6_K | Support, but not optimized |
|
||||
| Q8_0 | Support |
|
||||
| MXFP4 | Support |
|
||||
|
||||
## Model Preparation
|
||||
|
||||
You can refer to the general [llama-quantize tool](/tools/quantize/README.md) for steps to convert a model in Hugging Face safetensor format to GGUF with quantization.
|
||||
|
||||
Currently we support `Q4_0` quantization and have optimized for it. To achieve best performance on Adreno GPU, add `--pure` to `llama-quantize` (i.e., make all weights in `Q4_0`). For example,
|
||||
|
||||
```sh
|
||||
./llama-quantize --pure ggml-model-qwen2.5-3b-f16.gguf ggml-model-qwen-3b-Q4_0.gguf Q4_0
|
||||
```
|
||||
|
||||
Since `Q6_K` is also supported, `Q4_0` quantization without `--pure` will also work. However, the performance will be worse compared to pure `Q4_0` quantization.
|
||||
|
||||
### `MXFP4` MoE Models
|
||||
|
||||
OpenAI gpt-oss models are MoE models in `MXFP4`. The quantized model will be in `MXFP4_MOE`, a mixture of `MXFP4` and `Q8_0`.
|
||||
For this quantization, there is no need to specify `--pure`.
|
||||
For gpt-oss-20b model, you can directly [download](https://huggingface.co/ggml-org/gpt-oss-20b-GGUF) the quantized GGUF file in `MXFP4_MOE` from Hugging Face.
|
||||
|
||||
Although it is possible to quantize gpt-oss-20b model in pure `Q4_0` (all weights in `Q4_0`), it is not recommended since `MXFP4` has been optimized for MoE while `Q4_0` is not. In addition, accuracy should degrade with such pure `Q4_0` quantization.
|
||||
Hence, using the default `MXFP4_MOE` quantization (see the link above) is recommended for this model.
|
||||
|
||||
> Note that the `Q4_0` model found [here](https://huggingface.co/unsloth/gpt-oss-20b-GGUF/blob/main/gpt-oss-20b-Q4_0.gguf) is a mixture of `Q4_0`, `Q8_0` and `MXFP4` and gives better performance than `MXFP4_MOE` quantization.
|
||||
|
||||
## CMake Options
|
||||
|
||||
The OpenCL backend has the following CMake options that control the behavior of the backend.
|
||||
|
||||
| CMake options | Default value | Description |
|
||||
|:---------------------------------:|:--------------:|:------------------------------------------|
|
||||
| `GGML_OPENCL_EMBED_KERNELS` | `ON` | Embed OpenCL kernels into the executable. |
|
||||
| `GGML_OPENCL_USE_ADRENO_KERNELS` | `ON` | Use kernels optimized for Adreno. |
|
||||
|
||||
## Android
|
||||
|
||||
Ubuntu 22.04 is used for targeting Android. Make sure the following tools are accessible from command line,
|
||||
|
||||
* Git
|
||||
* CMake 3.29
|
||||
* Ninja
|
||||
* Python3
|
||||
|
||||
### I. Setup Environment
|
||||
|
||||
1. **Install NDK**
|
||||
|
||||
```sh
|
||||
cd ~
|
||||
wget https://dl.google.com/android/repository/commandlinetools-linux-8512546_latest.zip && \
|
||||
unzip commandlinetools-linux-8512546_latest.zip && \
|
||||
mkdir -p ~/android-sdk/cmdline-tools && \
|
||||
mv cmdline-tools latest && \
|
||||
mv latest ~/android-sdk/cmdline-tools/ && \
|
||||
rm -rf commandlinetools-linux-8512546_latest.zip
|
||||
|
||||
yes | ~/android-sdk/cmdline-tools/latest/bin/sdkmanager "ndk;26.3.11579264"
|
||||
```
|
||||
|
||||
2. **Install OpenCL Headers and Library**
|
||||
|
||||
```sh
|
||||
mkdir -p ~/dev/llm
|
||||
cd ~/dev/llm
|
||||
|
||||
git clone https://github.com/KhronosGroup/OpenCL-Headers && \
|
||||
cd OpenCL-Headers && \
|
||||
cp -r CL ~/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include
|
||||
|
||||
cd ~/dev/llm
|
||||
|
||||
git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && \
|
||||
cd OpenCL-ICD-Loader && \
|
||||
mkdir build_ndk26 && cd build_ndk26 && \
|
||||
cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release \
|
||||
-DCMAKE_TOOLCHAIN_FILE=$HOME/android-sdk/ndk/26.3.11579264/build/cmake/android.toolchain.cmake \
|
||||
-DOPENCL_ICD_LOADER_HEADERS_DIR=$HOME/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include \
|
||||
-DANDROID_ABI=arm64-v8a \
|
||||
-DANDROID_PLATFORM=24 \
|
||||
-DANDROID_STL=c++_shared && \
|
||||
ninja && \
|
||||
cp libOpenCL.so ~/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android
|
||||
```
|
||||
|
||||
### II. Build llama.cpp
|
||||
|
||||
```sh
|
||||
cd ~/dev/llm
|
||||
|
||||
git clone https://github.com/ggml-org/llama.cpp && \
|
||||
cd llama.cpp && \
|
||||
mkdir build-android && cd build-android
|
||||
|
||||
cmake .. -G Ninja \
|
||||
-DCMAKE_TOOLCHAIN_FILE=$HOME/android-sdk/ndk/26.3.11579264/build/cmake/android.toolchain.cmake \
|
||||
-DANDROID_ABI=arm64-v8a \
|
||||
-DANDROID_PLATFORM=android-28 \
|
||||
-DBUILD_SHARED_LIBS=OFF \
|
||||
-DGGML_OPENCL=ON
|
||||
|
||||
ninja
|
||||
```
|
||||
|
||||
## Windows 11 Arm64
|
||||
|
||||
A Snapdragon X Elite device with Windows 11 Arm64 is used. Make sure the following tools are accessible from command line,
|
||||
|
||||
* Git
|
||||
* CMake 3.29
|
||||
* Clang 19
|
||||
* Ninja
|
||||
* Visual Studio 2022
|
||||
* Powershell 7
|
||||
* Python
|
||||
|
||||
Visual Studio provides necessary headers and libraries although it is not directly used for building.
|
||||
Alternatively, Visual Studio Build Tools can be installed instead of the full Visual Studio.
|
||||
|
||||
> Note that building using Visual Studio's cl compiler is not supported. Clang must be used. Clang depends on libraries provided by Visual Studio to work. Therefore, Visual Studio must be installed. Alternatively, Visual Studio Build Tools can be installed instead of the full Visual Studio.
|
||||
|
||||
Powershell 7 is used for the following commands.
|
||||
If an older version of Powershell is used, these commands may not work as they are.
|
||||
|
||||
### I. Setup Environment
|
||||
|
||||
1. **Install OpenCL Headers and Library**
|
||||
|
||||
```powershell
|
||||
mkdir -p ~/dev/llm
|
||||
|
||||
cd ~/dev/llm
|
||||
git clone https://github.com/KhronosGroup/OpenCL-Headers && cd OpenCL-Headers
|
||||
mkdir build && cd build
|
||||
cmake .. -G Ninja `
|
||||
-DBUILD_TESTING=OFF `
|
||||
-DOPENCL_HEADERS_BUILD_TESTING=OFF `
|
||||
-DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF `
|
||||
-DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
|
||||
cmake --build . --target install
|
||||
|
||||
cd ~/dev/llm
|
||||
git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && cd OpenCL-ICD-Loader
|
||||
mkdir build && cd build
|
||||
cmake .. -G Ninja `
|
||||
-DCMAKE_BUILD_TYPE=Release `
|
||||
-DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" `
|
||||
-DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
|
||||
cmake --build . --target install
|
||||
```
|
||||
|
||||
### II. Build llama.cpp
|
||||
|
||||
```powershell
|
||||
|
||||
mkdir -p ~/dev/llm
|
||||
cd ~/dev/llm
|
||||
|
||||
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
|
||||
mkdir build && cd build
|
||||
|
||||
cmake .. -G Ninja `
|
||||
-DCMAKE_TOOLCHAIN_FILE="$HOME/dev/llm/llama.cpp/cmake/arm64-windows-llvm.cmake" `
|
||||
-DCMAKE_BUILD_TYPE=Release `
|
||||
-DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" `
|
||||
-DBUILD_SHARED_LIBS=OFF `
|
||||
-DGGML_OPENCL=ON
|
||||
ninja
|
||||
```
|
||||
|
||||
## Linux
|
||||
|
||||
The two steps just above also apply to Linux. When building for linux, the commands are mostly the same as those for PowerShell on Windows, but in the second step they do not have the `-DCMAKE_TOOLCHAIN_FILE` parameter, and then in both steps the backticks are replaced with back slashes.
|
||||
|
||||
If not installed already, install Git, CMake, Clang, Ninja and Python, then run in the terminal the following:
|
||||
|
||||
### I. Setup Environment
|
||||
|
||||
1. **Install OpenCL Headers and Library**
|
||||
|
||||
```bash
|
||||
mkdir -p ~/dev/llm
|
||||
|
||||
cd ~/dev/llm
|
||||
git clone https://github.com/KhronosGroup/OpenCL-Headers && cd OpenCL-Headers
|
||||
mkdir build && cd build
|
||||
cmake .. -G Ninja \
|
||||
-DBUILD_TESTING=OFF \
|
||||
-DOPENCL_HEADERS_BUILD_TESTING=OFF \
|
||||
-DOPENCL_HEADERS_BUILD_CXX_TESTS=OFF \
|
||||
-DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
|
||||
cmake --build . --target install
|
||||
|
||||
cd ~/dev/llm
|
||||
git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && cd OpenCL-ICD-Loader
|
||||
mkdir build && cd build
|
||||
cmake .. -G Ninja \
|
||||
-DCMAKE_BUILD_TYPE=Release \
|
||||
-DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \
|
||||
-DCMAKE_INSTALL_PREFIX="$HOME/dev/llm/opencl"
|
||||
cmake --build . --target install
|
||||
```
|
||||
|
||||
### II. Build llama.cpp
|
||||
|
||||
```bash
|
||||
mkdir -p ~/dev/llm
|
||||
cd ~/dev/llm
|
||||
|
||||
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
|
||||
mkdir build && cd build
|
||||
|
||||
cmake .. -G Ninja \
|
||||
-DCMAKE_BUILD_TYPE=Release \
|
||||
-DCMAKE_PREFIX_PATH="$HOME/dev/llm/opencl" \
|
||||
-DBUILD_SHARED_LIBS=OFF \
|
||||
-DGGML_OPENCL=ON
|
||||
ninja
|
||||
```
|
||||
|
||||
## Known Issues
|
||||
|
||||
- Flash attention does not always improve performance.
|
||||
- Currently OpenCL backend works on A6xx GPUs with recent drivers and compilers (usually found in IoT platforms).
|
||||
However, it does not work on A6xx GPUs found in phones with old drivers and compilers.
|
||||
|
||||
## TODO
|
||||
|
||||
- Optimization for Q6_K
|
||||
- Support and optimization for Q4_K
|
||||
- Improve flash attention
|
||||
386
docs/backend/OPENVINO.md
Normal file
386
docs/backend/OPENVINO.md
Normal file
@@ -0,0 +1,386 @@
|
||||
# OpenVINO Backend for llama.cpp
|
||||
|
||||
> [!NOTE]
|
||||
> Performance and memory optimizations, accuracy validation, broader quantization coverage, broader operator and model support are work in progress.
|
||||
|
||||
[OpenVINO](https://docs.openvino.ai/) is an open-source toolkit for optimizing and deploying high-performance AI inference, specifically designed for Intel hardware, including CPUs, GPUs, and NPUs, in the cloud, on-premises, and on the edge. [OpenVINO backend for llama.cpp](../../ggml/src/ggml-openvino) enables hardware-accelerated inference on **Intel® CPUs, GPUs, and NPUs** while remaining compatible with the existing **GGUF model ecosystem**. The backend translates GGML compute graphs into OpenVINO graphs and leverages graph compilation, kernel fusion, and device-specific optimizations to improve inference performance on supported Intel hardware.
|
||||
|
||||
The OpenVINO backend is implemented in `ggml/src/ggml-openvino` and provides a translation layer for core GGML operations. The OpenVINO backend replaces the standard GGML graph execution path with Intel's OpenVINO inference engine. This approach allows the same GGUF model file to run on Intel CPUs, Intel GPUs (integrated and discrete), and Intel NPUs without changes to the model or the rest of the llama.cpp stack. When a `ggml_cgraph` is dispatched to OpenVINO backend, it:
|
||||
|
||||
- Walks the GGML graph and identifies inputs, outputs, weights, and KV cache tensors.
|
||||
- Translates the GGML operations into an `ov::Model` using OpenVINO's frontend API.
|
||||
- Compiles and caches the model for the target device.
|
||||
- Binds GGML tensor memory to OpenVINO inference tensors and runs inference.
|
||||
|
||||
## Supported Devices
|
||||
|
||||
OpenVINO backend supports the following hardware:
|
||||
|
||||
- Intel CPUs
|
||||
- Intel GPUs (integrated and discrete)
|
||||
- Intel NPUs
|
||||
|
||||
Although OpenVINO supports a wide range of [Intel hardware](https://docs.openvino.ai/2026/about-openvino/release-notes-openvino/system-requirements.html), the llama.cpp OpenVINO backend has been validated specifically on AI PCs such as the Intel® Core™ Ultra Series 1 and Series 2.
|
||||
|
||||
## Supported Model Precisions
|
||||
|
||||
- `FP16`
|
||||
- `BF16` (on Intel Xeon)
|
||||
- `Q8_0`
|
||||
- `Q4_0`
|
||||
- `Q4_1`
|
||||
- `Q4_K`
|
||||
- `Q4_K_M`
|
||||
- `Q5_K` (converted to Q8_0_C at runtime)
|
||||
- `Q6_K` (converted to Q8_0_C at runtime)
|
||||
|
||||
> [!NOTE]
|
||||
> Accuracy validation and performance optimizations for quantized models are a work in progress.
|
||||
|
||||
## Quantization Support Details
|
||||
|
||||
### CPU and GPU
|
||||
|
||||
- **`Q4_0`, `Q4_1`, `Q4_K_M`, `Q6_K` models are supported**
|
||||
- `Q5_K` and `Q6_K` tensors are converted to `Q8_0_C`
|
||||
|
||||
### NPU
|
||||
|
||||
- **Primary supported quantization scheme is `Q4_0`**
|
||||
- `Q6_K` tensors are requantized to `Q4_0_128` in general. For embedding weights, `Q6_K` tensors are requantized to `Q8_0_C` except for the token embedding matrix which is dequantized to fp16
|
||||
|
||||
### Additional Notes
|
||||
|
||||
- Both `Q4_0` and `Q4_1` models use `Q6_K` for the token embedding tensor and the final matmul weight tensor (often the same tensor)
|
||||
- `Q4_0` models may produce some `Q4_1` tensors if an imatrix is provided during quantization using `llama-quantize`
|
||||
- `Q4_K_M` models may include both `Q6_K` and `Q5_K` tensors (observed in Phi-3)
|
||||
|
||||
## Validated Models
|
||||
|
||||
The following models have been validated for functionality on Intel® Core™ Ultra Series 1 and Series 2:
|
||||
|
||||
- [Llama-3.2-1B-Instruct-GGUF](https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/)
|
||||
- [Llama-3.1-8B-Instruct](https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF)
|
||||
- [microsoft/Phi-3-mini-4k-instruct-gguf](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf)
|
||||
- [Qwen/Qwen2.5-1.5B-Instruct-GGUF](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct-GGUF)
|
||||
- [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B-GGUF)
|
||||
- [openbmb/MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-S-1B-sft-gguf)
|
||||
- [tencent/Hunyuan-7B-Instruct](https://huggingface.co/bartowski/tencent_Hunyuan-7B-Instruct-GGUF)
|
||||
- [mistralai/Mistral-7B-Instruct-v0.3](https://huggingface.co/bartowski/Mistral-7B-Instruct-v0.3-GGUF)
|
||||
- [bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Llama-8B-GGUF)
|
||||
|
||||
## Build Instructions
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Linux or Windows system with Intel hardware (CPU, GPU, or NPU)
|
||||
- **For Intel GPU or NPU Usage**: Install the appropriate hardware drivers for your Intel GPU or NPU. For detailed instructions, see: [Additional Configurations for Hardware Acceleration](https://docs.openvino.ai/2025/get-started/install-openvino/configurations.html).
|
||||
|
||||
- **Linux:**
|
||||
- Git, CMake, and Ninja software tools are needed for building.
|
||||
```bash
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y build-essential libcurl4-openssl-dev libtbb12 cmake ninja-build python3-pip curl wget tar
|
||||
```
|
||||
- OpenCL
|
||||
```bash
|
||||
sudo apt install ocl-icd-opencl-dev opencl-headers opencl-clhpp-headers intel-opencl-icd
|
||||
```
|
||||
|
||||
- **Windows:**
|
||||
- Download and install [Microsoft Visual Studio 2022 Build Tools](https://aka.ms/vs/17/release/vs_BuildTools.exe). During installation, select the **"Desktop development with C++"** workload.
|
||||
|
||||
- Install required tools:
|
||||
```powershell
|
||||
# Windows PowerShell
|
||||
winget install Git.Git
|
||||
winget install GNU.Wget
|
||||
winget install Ninja-build.Ninja
|
||||
```
|
||||
|
||||
- Install **OpenCL** using **vcpkg**:
|
||||
```powershell
|
||||
# Windows PowerShell
|
||||
cd C:\
|
||||
git clone https://github.com/microsoft/vcpkg
|
||||
cd vcpkg
|
||||
.\bootstrap-vcpkg.bat
|
||||
.\vcpkg install opencl
|
||||
# Optional but recommended: Integrate vcpkg with Visual Studio / CMake:
|
||||
.\vcpkg integrate install
|
||||
```
|
||||
|
||||
### 1. Install OpenVINO Runtime
|
||||
|
||||
- Follow the guide to install OpenVINO Runtime from an archive file: [Linux](https://docs.openvino.ai/2026/get-started/install-openvino/install-openvino-archive-linux.html) | [Windows](https://docs.openvino.ai/2026/get-started/install-openvino/install-openvino-archive-windows.html)
|
||||
|
||||
- **Linux:**
|
||||
|
||||
<details>
|
||||
<summary>📦 Click to expand OpenVINO installation from an archive file on Ubuntu</summary>
|
||||
<br>
|
||||
|
||||
```bash
|
||||
wget https://raw.githubusercontent.com/ravi9/misc-scripts/main/openvino/ov-archive-install/install-openvino-from-archive.sh
|
||||
chmod +x install-openvino-from-archive.sh
|
||||
./install-openvino-from-archive.sh
|
||||
```
|
||||
|
||||
Verify OpenVINO is initialized properly:
|
||||
```bash
|
||||
echo $OpenVINO_DIR
|
||||
```
|
||||
</details>
|
||||
|
||||
|
||||
### 2. Build llama.cpp with OpenVINO Backend
|
||||
|
||||
Clone the OpenVINO-enabled llama.cpp fork and build it:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/ggml-org/llama.cpp
|
||||
cd llama.cpp
|
||||
```
|
||||
|
||||
- **Linux:**
|
||||
```bash
|
||||
source /opt/intel/openvino/setupvars.sh
|
||||
cmake -B build/ReleaseOV -G Ninja -DCMAKE_BUILD_TYPE=Release -DGGML_OPENVINO=ON
|
||||
cmake --build build/ReleaseOV --parallel
|
||||
```
|
||||
|
||||
- **Windows:**
|
||||
```cmd
|
||||
# x64 Native Tools Command Prompt for VS 2022
|
||||
"C:\Program Files (x86)\Intel\openvino_2026.0\setupvars.bat"
|
||||
cmake -B build\ReleaseOV -G Ninja -DCMAKE_BUILD_TYPE=Release -DGGML_OPENVINO=ON -DLLAMA_CURL=OFF -DCMAKE_TOOLCHAIN_FILE=C:\vcpkg\scripts\buildsystems\vcpkg.cmake
|
||||
cmake --build build\ReleaseOV --parallel
|
||||
```
|
||||
> [!NOTE]
|
||||
> Use `x64 Native Tools Command Prompt` for Windows build. After building, you could use either `cmd` or `PowerShell` to run the OpenVINO backend.
|
||||
|
||||
### 3. Download Sample Model
|
||||
|
||||
Download models for testing:
|
||||
|
||||
```bash
|
||||
# Linux
|
||||
mkdir -p ~/models/
|
||||
wget https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf \
|
||||
-O ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf
|
||||
|
||||
# Windows PowerShell
|
||||
mkdir C:\models
|
||||
Invoke-WebRequest -Uri https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf -OutFile C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf
|
||||
|
||||
# Windows Command Line
|
||||
mkdir C:\models
|
||||
curl -L https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf -o C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf
|
||||
```
|
||||
|
||||
### 4. Run Inference with OpenVINO Backend
|
||||
|
||||
When using the OpenVINO backend, the first inference token may have slightly higher latency due to on-the-fly conversion to the OpenVINO graph. Subsequent tokens and runs will be faster.
|
||||
|
||||
> [!NOTE]
|
||||
> Default context size is set to the model training context, which may be very large. For example, 131072 for Llama 3.2 1B, which may result in lower performance, especially on edge/laptop devices. Use `-c` to limit context size in supported llama.cpp tools for better performance. For example, `-c 512`.
|
||||
|
||||
```bash
|
||||
# If device is unset or unavailable, defaults to CPU.
|
||||
# If the system has multiple GPUs, use GPU.0 or GPU.1 to explicitly target a specific GPU.
|
||||
|
||||
# Linux
|
||||
export GGML_OPENVINO_DEVICE=GPU
|
||||
# Enable stateful execution with GPU device to avoid known stateless execution failures.
|
||||
export GGML_OPENVINO_STATEFUL_EXECUTION=1
|
||||
# To run llama-simple:
|
||||
./build/ReleaseOV/bin/llama-simple -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf -n 50 "The story of AI is "
|
||||
# To run in chat mode:
|
||||
./build/ReleaseOV/bin/llama-cli -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf -c 1024
|
||||
# To run llama-bench, -fa 1 is needed
|
||||
GGML_OPENVINO_STATEFUL_EXECUTION=1 GGML_OPENVINO_DEVICE=GPU ./build/ReleaseOV/bin/llama-bench -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf -fa 1
|
||||
|
||||
# NPU: keep context small to avoid failures from very large model context windows.
|
||||
export GGML_OPENVINO_DEVICE=NPU
|
||||
./build/ReleaseOV/bin/llama-cli -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf -c 512
|
||||
|
||||
# Windows Command Line
|
||||
set GGML_OPENVINO_DEVICE=GPU
|
||||
# Enable stateful execution with GPU device to avoid known stateless execution failures.
|
||||
set GGML_OPENVINO_STATEFUL_EXECUTION=1
|
||||
# Windows PowerShell
|
||||
$env:GGML_OPENVINO_DEVICE = "GPU"
|
||||
$env:GGML_OPENVINO_STATEFUL_EXECUTION = "1"
|
||||
|
||||
# To run llama-simple
|
||||
build\ReleaseOV\bin\llama-simple.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf" -n 50 "The story of AI is "
|
||||
# To run in chat mode:
|
||||
build\ReleaseOV\bin\llama-cli.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf" -c 1024
|
||||
# To run llama-bench, -fa 1 is needed
|
||||
build\ReleaseOV\bin\llama-bench.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf" -fa 1
|
||||
|
||||
# NPU: keep context small to avoid failures from very large model context windows.
|
||||
# Windows Command Line
|
||||
set GGML_OPENVINO_DEVICE=NPU
|
||||
# Windows PowerShell
|
||||
$env:GGML_OPENVINO_DEVICE = "NPU"
|
||||
build\ReleaseOV\bin\llama-cli.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf" -c 512
|
||||
```
|
||||
> [!NOTE]
|
||||
> On systems with multiple GPUs, use `GPU.0` or `GPU.1` to explicitly target specific GPU. See [OpenVINO GPU Device](https://docs.openvino.ai/2026/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html) for more details.
|
||||
|
||||
### Known Issues and Current Workarounds
|
||||
|
||||
- GPU stateless execution is currently affected by a known issue.
|
||||
- Workaround: set `GGML_OPENVINO_STATEFUL_EXECUTION=1` when using GPU device.
|
||||
- NPU failures can happen when context size is too large. Recent llama.cpp behavior may resolve context size to the model training context (for example, 131072 for Llama 3.2 1B), which is too large for current NPU usage and can also stress laptop CPU/GPU on larger models. To inspect the selected context size, run `llama-cli` or `llama-server` with `-lv 3`.
|
||||
- Workaround: explicitly set context size, for ex. `-c 1024` for NPU runs. Performance will be better with lower context size.
|
||||
- Additional NPU limitations:
|
||||
- Model caching is not yet supported.
|
||||
- `llama-server -np > 1` (multiple parallel sequences) is not supported.
|
||||
- `llama-perplexity` is only supported with `-b 512` or smaller.
|
||||
- `--context-shift` with `llama-cli` is currently not supported with OpenVINO backend across CPU, GPU, and NPU devices.
|
||||
- Encoder models (embedding, reranking) are not supported with the current OpenVINO backend implementation.
|
||||
- `-fa 1` is required when running llama-bench with the OpenVINO backend.
|
||||
- `GGML_OPENVINO_STATEFUL_EXECUTION=1 GGML_OPENVINO_DEVICE=GPU ./llama-bench -fa 1`
|
||||
- `llama-server` with OpenVINO backend supports only one chat session/thread, when `GGML_OPENVINO_STATEFUL_EXECUTION=1` is enabled.
|
||||
|
||||
> [!NOTE]
|
||||
> The OpenVINO backend is actively under development. Fixes are underway, and this document will continue to be updated as issues are resolved.
|
||||
|
||||
|
||||
### Docker Build
|
||||
|
||||
You can build and run llama.cpp with OpenVINO backend using Docker.
|
||||
|
||||
```bash
|
||||
# Build the base runtime image with compiled shared libraries and minimal dependencies.
|
||||
docker build -t llama-openvino:base -f .devops/openvino.Dockerfile .
|
||||
|
||||
# Build the complete image with all binaries, Python tools, gguf-py library, and model conversion utilities.
|
||||
docker build --target=full -t llama-openvino:full -f .devops/openvino.Dockerfile .
|
||||
|
||||
# Build a minimal CLI-only image containing just the llama-cli executable.
|
||||
docker build --target=light -t llama-openvino:light -f .devops/openvino.Dockerfile .
|
||||
|
||||
# Builds a server-only image with llama-server executable, health check endpoint, and REST API support.
|
||||
docker build --target=server -t llama-openvino:server -f .devops/openvino.Dockerfile .
|
||||
|
||||
# If you are behind a proxy:
|
||||
docker build --build-arg http_proxy=$http_proxy --build-arg https_proxy=$https_proxy --target=light -t llama-openvino:light -f .devops/openvino.Dockerfile .
|
||||
```
|
||||
|
||||
Run llama.cpp with OpenVINO backend Docker container.
|
||||
Save sample models in `~/models` as [shown above](#3-download-sample-model). It will be mounted to the container in the examples below.
|
||||
|
||||
|
||||
```bash
|
||||
# Run Docker container
|
||||
docker run --rm -it -v ~/models:/models llama-openvino:light --no-warmup -c 1024 -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
|
||||
|
||||
# With Intel GPU access (iGPU or dGPU)
|
||||
docker run --rm -it -v ~/models:/models \
|
||||
--device=/dev/dri --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
|
||||
--env=GGML_OPENVINO_DEVICE=GPU --env=GGML_OPENVINO_STATEFUL_EXECUTION=1 \
|
||||
llama-openvino:light --no-warmup -c 1024 -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
|
||||
|
||||
# With Intel NPU access
|
||||
docker run --rm -it -v ~/models:/models \
|
||||
--device=/dev/accel --group-add=$(stat -c "%g" /dev/dri/render* | head -n 1) -u $(id -u):$(id -g) \
|
||||
--env=GGML_OPENVINO_DEVICE=NPU \
|
||||
llama-openvino:light --no-warmup -c 1024 -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf
|
||||
```
|
||||
|
||||
Run Llama.cpp Server with OpenVINO Backend.
|
||||
> [!NOTE]
|
||||
> `llama-server` with OpenVINO backend supports only one chat session/thread, when `GGML_OPENVINO_STATEFUL_EXECUTION=1` is enabled.
|
||||
|
||||
```bash
|
||||
# Run the Server Docker container
|
||||
docker run --rm -it -p 8080:8080 -v ~/models:/models llama-openvino:server --no-warmup -m /models/Llama-3.2-1B-Instruct-Q4_0.gguf -c 1024
|
||||
# Or Using llama-server executable
|
||||
./build/ReleaseOV/bin/llama-server -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf --port 8080 -c 1024
|
||||
|
||||
# If you are behind a proxy, make sure to set NO_PROXY to avoid proxy for localhost
|
||||
export NO_PROXY=localhost,127.0.0.1
|
||||
|
||||
# Option 1: Open your browser to http://localhost:8080 to access the web UI for the llama.cpp server.
|
||||
# Option 2: In a NEW terminal, test the server with curl
|
||||
|
||||
# Test health endpoint
|
||||
curl -f http://localhost:8080/health
|
||||
|
||||
# Test with a simple prompt
|
||||
curl -X POST "http://localhost:8080/v1/chat/completions" -H "Content-Type: application/json" \
|
||||
-d '{"messages":[{"role":"user","content":"Write a poem about OpenVINO"}],"max_tokens":100}' | jq .
|
||||
```
|
||||
|
||||
## Runtime Configuration
|
||||
|
||||
The OpenVINO backend can be configured using the following environment variables at runtime to control device selection, caching, debugging, and profiling behavior.
|
||||
|
||||
### Configuration Options
|
||||
|
||||
| Variable | Default | Description |
|
||||
|-----------------------------------|------------|-------------------------------------------------------------------------------------------------------------|
|
||||
| `GGML_OPENVINO_DEVICE` | `CPU` | Specify the target device (CPU, GPU, NPU). On systems with multiple GPUs, use `GPU.0` or `GPU.1` to explicitly target specific GPU. See [OpenVINO GPU Device](https://docs.openvino.ai/2026/openvino-workflow/running-inference/inference-devices-and-modes/gpu-device.html). When set to **NPU**, static compilation mode is enabled for optimal performance. |
|
||||
| `GGML_OPENVINO_CACHE_DIR` | `not set` | Directory for OpenVINO model caching (recommended: `/tmp/ov_cache`). Enables model caching when set. **Not supported on NPU devices.** |
|
||||
| `GGML_OPENVINO_PREFILL_CHUNK_SIZE`| `256` | Token chunk size for **NPU** prefill. |
|
||||
| `GGML_OPENVINO_STATEFUL_EXECUTION`| `0` | Enable stateful KV cache on for better performance. Recommended on CPU, GPU. |
|
||||
| `GGML_OPENVINO_PROFILING` | `0` | Enable execution-time profiling. |
|
||||
| `GGML_OPENVINO_DUMP_CGRAPH` | `0` | Dump the GGML compute graph to `cgraph_ov.txt`. |
|
||||
| `GGML_OPENVINO_DUMP_IR` | `0` | Serialize OpenVINO IR files with timestamps. |
|
||||
| `GGML_OPENVINO_DEBUG_INPUT` | `0` | Enable input debugging and print input tensor info. |
|
||||
| `GGML_OPENVINO_DEBUG_OUTPUT` | `0` | Enable output debugging and print output tensor info. |
|
||||
| `GGML_OPENVINO_PRINT_CGRAPH_TENSOR_ADDRESS` | `0` | Print tensor address map once. |
|
||||
|
||||
> [!NOTE]
|
||||
>`GGML_OPENVINO_STATEFUL_EXECUTION` is an **Experimental** feature to allow stateful execution for managing the KV cache internally inside the OpenVINO model, improving performance on CPUs and GPUs. Stateful execution is not effective on NPUs, and not all models currently support this feature. This feature is experimental and has been validated only with the llama-simple, llama-cli, llama-bench, and llama-run applications and is recommended to enable for the best performance. Other applications, such as llama-server and llama-perplexity, are not yet supported.
|
||||
|
||||
### Example Usage
|
||||
|
||||
#### GPU Inference with Profiling
|
||||
|
||||
```bash
|
||||
# If the system has multiple GPUs, use GPU.0 or GPU.1 to explicitly target a specific GPU.
|
||||
|
||||
# Linux
|
||||
export GGML_OPENVINO_CACHE_DIR=/tmp/ov_cache
|
||||
export GGML_OPENVINO_PROFILING=1
|
||||
export GGML_OPENVINO_DEVICE=GPU
|
||||
export GGML_OPENVINO_STATEFUL_EXECUTION=1
|
||||
|
||||
./build/ReleaseOV/bin/llama-simple -m ~/models/Llama-3.2-1B-Instruct-Q4_0.gguf -n 50 "The story of AI is "
|
||||
|
||||
# Windows Command Line
|
||||
set GGML_OPENVINO_CACHE_DIR=C:\tmp\ov_cache
|
||||
set GGML_OPENVINO_PROFILING=1
|
||||
set GGML_OPENVINO_DEVICE=GPU
|
||||
set GGML_OPENVINO_STATEFUL_EXECUTION=1
|
||||
|
||||
# Windows PowerShell
|
||||
$env:GGML_OPENVINO_CACHE_DIR = "C:\tmp\ov_cache"
|
||||
$env:GGML_OPENVINO_PROFILING = "1"
|
||||
$env:GGML_OPENVINO_DEVICE = "GPU"
|
||||
$env:GGML_OPENVINO_STATEFUL_EXECUTION = "1"
|
||||
|
||||
build\ReleaseOV\bin\llama-simple.exe -m "C:\models\Llama-3.2-1B-Instruct-Q4_0.gguf" -n 50 "The story of AI is "
|
||||
|
||||
```
|
||||
|
||||
## Llama.cpp Tools
|
||||
|
||||
The following tools work with the OpenVINO backend on CPU, GPU, NPU:
|
||||
- llama-bench
|
||||
- llama-cli
|
||||
- llama-completion
|
||||
- llama-perplexity
|
||||
- llama-server
|
||||
- llama-simple
|
||||
|
||||
## Work in Progress
|
||||
|
||||
- Performance and memory optimizations
|
||||
- Accuracy validation
|
||||
- Broader quantization coverage
|
||||
- Support for additional model architectures
|
||||
833
docs/backend/SYCL.md
Normal file
833
docs/backend/SYCL.md
Normal file
@@ -0,0 +1,833 @@
|
||||
# llama.cpp for SYCL
|
||||
|
||||
- [Background](#background)
|
||||
- [Recommended Release](#recommended-release)
|
||||
- [News](#news)
|
||||
- [OS](#os)
|
||||
- [Hardware](#hardware)
|
||||
- [Docker](#docker)
|
||||
- [Linux](#linux)
|
||||
- [Windows](#windows)
|
||||
- [Environment Variable](#environment-variable)
|
||||
- [Design Rule](#design-rule)
|
||||
- [Known Issue](#known-issues)
|
||||
- [Q&A](#qa)
|
||||
- [TODO](#todo)
|
||||
|
||||
## Background
|
||||
|
||||
**SYCL** is a high-level parallel programming model designed to improve developers productivity writing code across various hardware accelerators such as CPUs, GPUs, and FPGAs. It is a single-source language designed for heterogeneous computing and based on standard C++17.
|
||||
|
||||
**oneAPI** is an open ecosystem and a standard-based specification, supporting multiple architectures including but not limited to Intel CPUs, GPUs and FPGAs. The key components of the oneAPI ecosystem include:
|
||||
|
||||
- **DPCPP** *(Data Parallel C++)*: The primary oneAPI SYCL implementation, which includes the icpx/icx Compilers.
|
||||
- **oneAPI Libraries**: A set of highly optimized libraries targeting multiple domains *(e.g. Intel oneMKL, oneMath and oneDNN)*.
|
||||
- **oneAPI LevelZero**: A high performance low level interface for fine-grained control over Intel iGPUs and dGPUs.
|
||||
|
||||
### Llama.cpp + SYCL
|
||||
|
||||
The llama.cpp SYCL backend is primarily designed for **Intel GPUs**.
|
||||
SYCL cross-platform capabilities enable support for other vendor GPUs as well.
|
||||
|
||||
## Recommended Release
|
||||
|
||||
### Windows
|
||||
|
||||
The following releases are verified and recommended:
|
||||
|
||||
|Commit ID|Tag|Release|Verified Platform| Update date|
|
||||
|-|-|-|-|-|
|
||||
|24e86cae7219b0f3ede1d5abdf5bf3ad515cccb8|b5377 |[llama-b5377-bin-win-sycl-x64.zip](https://github.com/ggml-org/llama.cpp/releases/download/b5377/llama-b5377-bin-win-sycl-x64.zip) |Arc B580/Linux/oneAPI 2025.1<br>LNL Arc GPU/Windows 11/oneAPI 2025.1.1|2025-05-15|
|
||||
|3bcd40b3c593d14261fb2abfabad3c0fb5b9e318|b4040 |[llama-b4040-bin-win-sycl-x64.zip](https://github.com/ggml-org/llama.cpp/releases/download/b4040/llama-b4040-bin-win-sycl-x64.zip) |Arc A770/Linux/oneAPI 2024.1<br>MTL Arc GPU/Windows 11/oneAPI 2024.1| 2024-11-19|
|
||||
|fb76ec31a9914b7761c1727303ab30380fd4f05c|b3038 |[llama-b3038-bin-win-sycl-x64.zip](https://github.com/ggml-org/llama.cpp/releases/download/b3038/llama-b3038-bin-win-sycl-x64.zip) |Arc A770/Linux/oneAPI 2024.1<br>MTL Arc GPU/Windows 11/oneAPI 2024.1||
|
||||
|
||||
### Ubuntu 24.04
|
||||
|
||||
The release packages for Ubuntu 24.04 x64 (FP32/FP16) only include the binary files of the llama.cpp SYCL backend. They require the target machine to have pre-installed Intel GPU drivers and oneAPI packages that are the same version as the build package. To get the version and installation info, refer to release.yml: ubuntu-24-sycl -> Download & Install oneAPI.
|
||||
|
||||
It is recommended to use them with Intel Docker.
|
||||
|
||||
The packages for FP32 and FP16 would have different accuracy and performance on LLMs. Please choose it acording to the test result.
|
||||
|
||||
## News
|
||||
|
||||
- 2026.04
|
||||
|
||||
- Optimize mul_mat by reorder feature for data type: Q4_K, Q5_K, Q_K, Q8_0.
|
||||
- Fused MoE.
|
||||
- Upgrate CI and built package for oneAPI 2025.3.3, support Ubuntu 24.04 built package.
|
||||
|
||||
- 2026.03
|
||||
- Support Flash-Attention: less memory usage, performance impact depends on LLM.
|
||||
|
||||
- 2026.02
|
||||
- Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. User can't build up the software for Nvidia & AMD GPU.
|
||||
|
||||
- 2025.11
|
||||
- Support malloc memory on device more than 4GB.
|
||||
|
||||
- 2025.2
|
||||
- Optimize MUL_MAT Q4_0 on Intel GPU for all dGPUs and built-in GPUs since MTL. Increase the performance of LLM (llama-2-7b.Q4_0.gguf) 21%-87% on Intel GPUs (MTL, ARL-H, Arc, Flex, PVC).
|
||||
|GPU|Base tokens/s|Increased tokens/s|Percent|
|
||||
|-|-|-|-|
|
||||
|PVC 1550|39|73|+87%|
|
||||
|Flex 170|39|50|+28%|
|
||||
|Arc A770|42|55|+30%|
|
||||
|MTL|13|16|+23%|
|
||||
|ARL-H|14|17|+21%|
|
||||
|
||||
- 2024.11
|
||||
- Use syclcompat to improve the performance on some platforms. This requires to use oneAPI 2025.0 or newer.
|
||||
|
||||
- 2024.8
|
||||
- Use oneDNN as the default GEMM library, improve the compatibility for new Intel GPUs.
|
||||
|
||||
- 2024.5
|
||||
- Performance is increased: 34 -> 37 tokens/s of llama-2-7b.Q4_0 on Arc A770.
|
||||
- Arch Linux is verified successfully.
|
||||
|
||||
- 2024.4
|
||||
- Support data types: GGML_TYPE_IQ4_NL, GGML_TYPE_IQ4_XS, GGML_TYPE_IQ3_XXS, GGML_TYPE_IQ3_S, GGML_TYPE_IQ2_XXS, GGML_TYPE_IQ2_XS, GGML_TYPE_IQ2_S, GGML_TYPE_IQ1_S, GGML_TYPE_IQ1_M.
|
||||
|
||||
- 2024.3
|
||||
- Release binary files of Windows.
|
||||
- A blog is published: **Run LLM on all Intel GPUs Using llama.cpp**: [intel.com](https://www.intel.com/content/www/us/en/developer/articles/technical/run-llm-on-all-gpus-using-llama-cpp-artical.html) or [medium.com](https://medium.com/@jianyu_neo/run-llm-on-all-intel-gpus-using-llama-cpp-fd2e2dcbd9bd).
|
||||
- New base line is ready: [tag b2437](https://github.com/ggml-org/llama.cpp/tree/b2437).
|
||||
- Support multiple cards: **--split-mode**: [none|layer]; not support [row], it's on developing.
|
||||
- Support to assign main GPU by **--main-gpu**, replace $GGML_SYCL_DEVICE.
|
||||
- Support detecting all GPUs with level-zero and same top **Max compute units**.
|
||||
- Support OPs
|
||||
- hardsigmoid
|
||||
- hardswish
|
||||
- pool2d
|
||||
|
||||
- 2024.1
|
||||
- Create SYCL backend for Intel GPU.
|
||||
- Support Windows build
|
||||
|
||||
## OS
|
||||
|
||||
| OS | Status | Verified |
|
||||
|---------|---------|------------------------------------------------|
|
||||
| Linux | Support | Ubuntu 22.04, Fedora Silverblue 39, Arch Linux |
|
||||
| Windows | Support | Windows 11 |
|
||||
|
||||
|
||||
## Hardware
|
||||
|
||||
### Intel GPU
|
||||
|
||||
SYCL backend supports Intel GPU Family:
|
||||
|
||||
- Intel Data Center Max Series
|
||||
- Intel Flex Series, Arc Series
|
||||
- Intel Built-in Arc GPU
|
||||
- Intel iGPU in Core CPU (11th Generation Core CPU and newer, refer to [oneAPI supported GPU](https://www.intel.com/content/www/us/en/developer/articles/system-requirements/intel-oneapi-base-toolkit-system-requirements.html#inpage-nav-1-1)).
|
||||
|
||||
On older Intel GPUs, you may try [OpenCL](/docs/backend/OPENCL.md) although the performance is not optimal, and some GPUs may not support OpenCL nor have any GPGPU capabilities.
|
||||
|
||||
#### Verified devices
|
||||
|
||||
| Intel GPU | Status | Verified Model |
|
||||
|-------------------------------|---------|---------------------------------------|
|
||||
| Intel Data Center Max Series | Support | Max 1550, 1100 |
|
||||
| Intel Data Center Flex Series | Support | Flex 170 |
|
||||
| Intel Arc A-Series | Support | Arc A770, Arc A730M, Arc A750 |
|
||||
| Intel Arc B-Series | Support | Arc B580 |
|
||||
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake, Arrow Lake, Lunar Lake |
|
||||
| Intel iGPU | Support | iGPU in 13700k, 13400, i5-1250P, i7-1260P, i7-1165G7 |
|
||||
|
||||
*Notes:*
|
||||
|
||||
- **Memory**
|
||||
- The device memory is a limitation when running a large model. The loaded model size, *`llm_load_tensors: buffer_size`*, is displayed in the log when running `./bin/llama-completion`.
|
||||
- Please make sure the GPU shared memory from the host is large enough to account for the model's size. For e.g. the *llama-2-7b.Q4_0* requires at least 8.0GB for integrated GPU and 4.0GB for discrete GPU.
|
||||
|
||||
- **Execution Unit (EU)**
|
||||
- If the iGPU has less than 80 EUs, the inference speed will likely be too slow for practical use.
|
||||
|
||||
### Other Vendor GPU
|
||||
|
||||
NA
|
||||
|
||||
## Docker
|
||||
|
||||
The docker build option is currently limited to *Intel GPU* targets.
|
||||
|
||||
### Build image
|
||||
|
||||
```sh
|
||||
# Using FP32
|
||||
docker build -t llama-cpp-sycl --build-arg="GGML_SYCL_F16=OFF" --target light -f .devops/intel.Dockerfile .
|
||||
|
||||
# Using FP16
|
||||
docker build -t llama-cpp-sycl --build-arg="GGML_SYCL_F16=ON" --target light -f .devops/intel.Dockerfile .
|
||||
```
|
||||
|
||||
*Notes*:
|
||||
|
||||
You can also use the `.devops/llama-server-intel.Dockerfile`, which builds the *"server"* alternative.
|
||||
Check the [documentation for Docker](../docker.md) to see the available images.
|
||||
|
||||
### Run container
|
||||
|
||||
```sh
|
||||
# First, find all the DRI cards
|
||||
ls -la /dev/dri
|
||||
# Then, pick the card that you want to use (here for e.g. /dev/dri/card1).
|
||||
docker run -it --rm -v "/path/to/models:/models" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 llama-cpp-sycl -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -c 4096 -s 0
|
||||
```
|
||||
|
||||
*Notes:*
|
||||
- Docker has been tested successfully on native Linux. WSL support has not been verified yet.
|
||||
- You may need to install Intel GPU driver on the **host** machine *(Please refer to the [Linux configuration](#linux) for details)*.
|
||||
|
||||
## Linux
|
||||
|
||||
### I. Setup Environment
|
||||
|
||||
1. **Install GPU drivers**
|
||||
|
||||
- **Intel GPU**
|
||||
|
||||
Intel data center GPUs drivers installation guide and download page can be found here: [Get intel dGPU Drivers](https://dgpu-docs.intel.com/driver/installation.html#ubuntu-install-steps).
|
||||
|
||||
*Note*: for client GPUs *(iGPU & Arc A-Series)*, please refer to the [client iGPU driver installation](https://dgpu-docs.intel.com/driver/client/overview.html).
|
||||
|
||||
Once installed, add the user(s) to the `video` and `render` groups.
|
||||
|
||||
```sh
|
||||
sudo usermod -aG render $USER
|
||||
sudo usermod -aG video $USER
|
||||
```
|
||||
|
||||
*Note*: logout/re-login for the changes to take effect.
|
||||
|
||||
Verify installation through `clinfo`:
|
||||
|
||||
```sh
|
||||
sudo apt install clinfo
|
||||
sudo clinfo -l
|
||||
```
|
||||
|
||||
Sample output:
|
||||
|
||||
```sh
|
||||
Platform #0: Intel(R) OpenCL Graphics
|
||||
`-- Device #0: Intel(R) Arc(TM) A770 Graphics
|
||||
|
||||
Platform #0: Intel(R) OpenCL HD Graphics
|
||||
`-- Device #0: Intel(R) Iris(R) Xe Graphics [0x9a49]
|
||||
```
|
||||
|
||||
2. **Install Intel® oneAPI Base toolkit**
|
||||
|
||||
SYCL backend depends on:
|
||||
- Intel® oneAPI DPC++/C++ compiler/running-time.
|
||||
- Intel® oneAPI DPC++/C++ library (oneDPL).
|
||||
- Intel® oneAPI Deep Neural Network Library (oneDNN).
|
||||
- Intel® oneAPI Math Kernel Library (oneMKL).
|
||||
|
||||
- **For Intel GPU**
|
||||
|
||||
All above are included in both **Intel® oneAPI Base toolkit** and **Intel® Deep Learning Essentials** packages.
|
||||
|
||||
It's recommended to install **Intel® Deep Learning Essentials** which only provides the necessary libraries with less size.
|
||||
|
||||
The **Intel® oneAPI Base toolkit** and **Intel® Deep Learning Essentials** can be obtained from the official [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) page.
|
||||
|
||||
Please follow the instructions for downloading and installing the Toolkit for Linux, and preferably keep the default installation values unchanged, notably the installation path *(`/opt/intel/oneapi` by default)*.
|
||||
|
||||
Following guidelines/code snippets assume the default installation values. Otherwise, please make sure the necessary changes are reflected where applicable.
|
||||
|
||||
Upon a successful installation, SYCL is enabled for the available intel devices, along with relevant libraries such as oneAPI oneDNN for Intel GPUs.
|
||||
|
||||
|Verified release|
|
||||
|-|
|
||||
|2025.3.3 |
|
||||
|2025.2.1|
|
||||
|2025.1|
|
||||
|2024.1|
|
||||
|
||||
3. **Verify installation and environment**
|
||||
|
||||
In order to check the available SYCL devices on the machine, please use the `sycl-ls` command.
|
||||
```sh
|
||||
source /opt/intel/oneapi/setvars.sh
|
||||
sycl-ls
|
||||
```
|
||||
|
||||
- **Intel GPU**
|
||||
|
||||
When targeting an intel GPU, the user should expect one or more devices among the available SYCL devices. Please make sure that at least one GPU is present via `sycl-ls`, for instance `[level_zero:gpu]` in the sample output below:
|
||||
|
||||
```
|
||||
[level_zero:gpu][level_zero:0] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) Arc(TM) A770 Graphics 12.55.8 [1.3.29735+27]
|
||||
[level_zero:gpu][level_zero:1] Intel(R) oneAPI Unified Runtime over Level-Zero, Intel(R) UHD Graphics 730 12.2.0 [1.3.29735+27]
|
||||
[opencl:cpu][opencl:0] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i5-13400 OpenCL 3.0 (Build 0) [2025.20.8.0.06_160000]
|
||||
[opencl:gpu][opencl:1] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [24.39.31294]
|
||||
[opencl:gpu][opencl:2] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 730 OpenCL 3.0 NEO [24.39.31294]
|
||||
```
|
||||
|
||||
### II. Build llama.cpp
|
||||
|
||||
#### Intel GPU
|
||||
|
||||
```sh
|
||||
./examples/sycl/build.sh
|
||||
```
|
||||
|
||||
or
|
||||
|
||||
```sh
|
||||
# Export relevant ENV variables
|
||||
source /opt/intel/oneapi/setvars.sh
|
||||
|
||||
# Option 1: Use FP32 (recommended for better performance in most cases)
|
||||
cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
|
||||
|
||||
# Option 2: Use FP16
|
||||
cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_F16=ON
|
||||
|
||||
# build all binary
|
||||
cmake --build build --config Release -j -v
|
||||
```
|
||||
|
||||
It is possible to come across some precision issues when running tests that stem from using faster
|
||||
instructions, which can be circumvented by setting the environment variable `SYCL_PROGRAM_COMPILE_OPTIONS`
|
||||
as `-cl-fp32-correctly-rounded-divide-sqrt`
|
||||
|
||||
### III. Run the inference
|
||||
|
||||
#### Retrieve and prepare model
|
||||
|
||||
You can refer to the general [*Obtaining and quantizing models*](../../README.md#obtaining-and-quantizing-models) guide for model preparation, or download an already quantized model like [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf?download=true) or [Meta-Llama-3-8B-Instruct-Q4_0.gguf](https://huggingface.co/aptha/Meta-Llama-3-8B-Instruct-Q4_0-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf).
|
||||
|
||||
##### Check device
|
||||
|
||||
1. Enable oneAPI running environment
|
||||
|
||||
```sh
|
||||
source /opt/intel/oneapi/setvars.sh
|
||||
```
|
||||
|
||||
2. List devices information
|
||||
|
||||
Similar to the native `sycl-ls`, available SYCL devices can be queried as follow:
|
||||
|
||||
```sh
|
||||
./build/bin/llama-ls-sycl-device
|
||||
```
|
||||
|
||||
This command will only display the selected backend that is supported by SYCL. The default backend is level_zero. For example, in a system with 2 *intel GPU* it would look like the following:
|
||||
```
|
||||
found 2 SYCL devices:
|
||||
|
||||
| | | |Compute |Max compute|Max work|Max sub| |
|
||||
|ID| Device Type| Name|capability|units |group |group |Global mem size|
|
||||
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
|
||||
| 0|[level_zero:gpu:0]| Intel(R) Arc(TM) A770 Graphics| 1.3| 512| 1024| 32| 16225243136|
|
||||
| 1|[level_zero:gpu:1]| Intel(R) UHD Graphics 770| 1.3| 32| 512| 32| 53651849216|
|
||||
```
|
||||
|
||||
#### Choose level-zero devices
|
||||
|
||||
|Chosen Device ID|Setting|
|
||||
|-|-|
|
||||
|0|`export ONEAPI_DEVICE_SELECTOR="level_zero:0"` or no action|
|
||||
|1|`export ONEAPI_DEVICE_SELECTOR="level_zero:1"`|
|
||||
|0 & 1|`export ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"`|
|
||||
|
||||
#### Execute
|
||||
|
||||
Choose one of following methods to run.
|
||||
|
||||
1. Script
|
||||
|
||||
- Use device 0:
|
||||
|
||||
```sh
|
||||
./examples/sycl/test.sh -mg 0
|
||||
```
|
||||
- Use multiple devices:
|
||||
|
||||
```sh
|
||||
./examples/sycl/test.sh
|
||||
```
|
||||
|
||||
- Run llama-server:
|
||||
|
||||
```sh
|
||||
./examples/sycl/start-svr.sh -m PATH/MODEL_FILE
|
||||
```
|
||||
|
||||
2. Command line
|
||||
Launch inference
|
||||
|
||||
There are two device selection modes:
|
||||
|
||||
- Single device: Use one device assigned by user. Default device id is 0.
|
||||
- Multiple devices: Automatically choose the devices with the same backend.
|
||||
|
||||
In two device selection modes, the default SYCL backend is level_zero, you can choose other backend supported by SYCL by setting environment variable ONEAPI_DEVICE_SELECTOR.
|
||||
|
||||
| Device selection | Parameter |
|
||||
|------------------|----------------------------------------|
|
||||
| Single device | --split-mode none --main-gpu DEVICE_ID |
|
||||
| Multiple devices | --split-mode layer (default) |
|
||||
|
||||
Examples:
|
||||
|
||||
- Use device 0:
|
||||
|
||||
```sh
|
||||
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-completion -no-cnv -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 99 -sm none -mg 0 --mmap
|
||||
```
|
||||
|
||||
- Use multiple devices:
|
||||
|
||||
```sh
|
||||
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-completion -no-cnv -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 99 -sm layer --mmap
|
||||
```
|
||||
|
||||
*Notes:*
|
||||
|
||||
- Upon execution, verify the selected device(s) ID(s) in the output log, which can for instance be displayed as follow:
|
||||
|
||||
```sh
|
||||
detect 1 SYCL GPUs: [0] with top Max compute units:512
|
||||
```
|
||||
Or
|
||||
```sh
|
||||
use 1 SYCL GPUs: [0] with Max compute units:512
|
||||
```
|
||||
|
||||
## Windows
|
||||
|
||||
### Install GPU driver
|
||||
|
||||
Intel GPU drivers instructions guide and download page can be found here: [Get Intel GPU Drivers](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/software/drivers.html).
|
||||
|
||||
### Option 1: download the binary package directly
|
||||
|
||||
Download the binary package for Windows from: https://github.com/ggml-org/llama.cpp/releases.
|
||||
|
||||
Extract the package to local folder, run the llama tools directly. Refer to [Run the inference](#iii-run-the-inference-1).
|
||||
|
||||
Note, the package includes the SYCL running time and all depended dll files, no need to install oneAPI package and activte them.
|
||||
|
||||
### Option 2: build locally from the source code.
|
||||
|
||||
#### I. Setup environment
|
||||
|
||||
1. Install Visual Studio
|
||||
|
||||
If you already have a recent version of Microsoft Visual Studio, you can skip this step. Otherwise, please refer to the official download page for [Microsoft Visual Studio](https://visualstudio.microsoft.com/).
|
||||
|
||||
2. Install Intel® oneAPI Base toolkit
|
||||
|
||||
SYCL backend depends on:
|
||||
- Intel® oneAPI DPC++/C++ compiler/running-time.
|
||||
- Intel® oneAPI DPC++/C++ library (oneDPL).
|
||||
- Intel® oneAPI Deep Neural Network Library (oneDNN).
|
||||
- Intel® oneAPI Math Kernel Library (oneMKL).
|
||||
|
||||
All above are included in both **Intel® oneAPI Base toolkit** and **Intel® Deep Learning Essentials** packages.
|
||||
|
||||
It's recommended to install **Intel® Deep Learning Essentials** which only provides the necessary libraries with less size.
|
||||
|
||||
The **Intel® oneAPI Base toolkit** and **Intel® Deep Learning Essentials** can be obtained from the official [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) page.
|
||||
|
||||
Please follow the instructions for downloading and installing the Toolkit for Windows, and preferably keep the default installation values unchanged, notably the installation path *(`C:\Program Files (x86)\Intel\oneAPI` by default)*.
|
||||
|
||||
Following guidelines/code snippets assume the default installation values. Otherwise, please make sure the necessary changes are reflected where applicable.
|
||||
|
||||
b. Enable oneAPI running environment:
|
||||
|
||||
- Type "oneAPI" in the search bar, then open the `Intel oneAPI command prompt for Intel 64 for Visual Studio 2022` App.
|
||||
|
||||
- On the command prompt, enable the runtime environment with the following:
|
||||
```
|
||||
"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
|
||||
```
|
||||
|
||||
- if you are using Powershell, enable the runtime environment with the following:
|
||||
|
||||
```
|
||||
cmd.exe "/K" '"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" && powershell'
|
||||
```
|
||||
|
||||
c. Verify installation
|
||||
|
||||
In the oneAPI command line, run the following to print the available SYCL devices:
|
||||
|
||||
```
|
||||
sycl-ls.exe
|
||||
```
|
||||
|
||||
There should be one or more *level-zero* GPU devices displayed as **[ext_oneapi_level_zero:gpu]**. Below is example of such output detecting an *intel Iris Xe* GPU as a Level-zero SYCL device:
|
||||
|
||||
Output (example):
|
||||
```
|
||||
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.10.0.17_160000]
|
||||
[opencl:cpu:1] Intel(R) OpenCL, 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz OpenCL 3.0 (Build 0) [2023.16.10.0.17_160000]
|
||||
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Iris(R) Xe Graphics OpenCL 3.0 NEO [31.0.101.5186]
|
||||
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Iris(R) Xe Graphics 1.3 [1.3.28044]
|
||||
```
|
||||
|
||||
3. Install build tools
|
||||
|
||||
a. Download & install cmake for Windows: https://cmake.org/download/ (CMake can also be installed from Visual Studio Installer)
|
||||
b. The new Visual Studio will install Ninja as default. (If not, please install it manually: https://ninja-build.org/)
|
||||
|
||||
|
||||
#### II. Build llama.cpp
|
||||
|
||||
You could download the release package for Windows directly, which including binary files and depended oneAPI dll files.
|
||||
|
||||
Choose one of following methods to build from source code.
|
||||
|
||||
##### Option 1: Script
|
||||
|
||||
```sh
|
||||
.\examples\sycl\win-build-sycl.bat
|
||||
```
|
||||
|
||||
##### Option 2: CMake
|
||||
|
||||
On the oneAPI command line window, step into the llama.cpp main directory and run the following:
|
||||
|
||||
```
|
||||
@call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force
|
||||
|
||||
# Option 1: Use FP32 (recommended for better performance in most cases)
|
||||
cmake -B build -G "Ninja" -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release
|
||||
|
||||
# Option 2: Or FP16
|
||||
cmake -B build -G "Ninja" -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release -DGGML_SYCL_F16=ON
|
||||
|
||||
cmake --build build --config Release -j
|
||||
```
|
||||
|
||||
Or, use CMake presets to build:
|
||||
|
||||
```sh
|
||||
cmake --preset x64-windows-sycl-release
|
||||
cmake --build build-x64-windows-sycl-release -j --target llama-completion
|
||||
|
||||
cmake -DGGML_SYCL_F16=ON --preset x64-windows-sycl-release
|
||||
cmake --build build-x64-windows-sycl-release -j --target llama-completion
|
||||
|
||||
cmake --preset x64-windows-sycl-debug
|
||||
cmake --build build-x64-windows-sycl-debug -j --target llama-completion
|
||||
```
|
||||
|
||||
##### Option 3: Visual Studio
|
||||
|
||||
You have two options to use Visual Studio to build llama.cpp:
|
||||
- As CMake Project using CMake presets.
|
||||
- Creating a Visual Studio solution to handle the project.
|
||||
|
||||
**Note**:
|
||||
|
||||
All following commands are executed in PowerShell.
|
||||
|
||||
###### - Open as a CMake Project
|
||||
|
||||
You can use Visual Studio to open the `llama.cpp` folder directly as a CMake project. Before compiling, select one of the SYCL CMake presets:
|
||||
|
||||
- `x64-windows-sycl-release`
|
||||
|
||||
- `x64-windows-sycl-debug`
|
||||
|
||||
*Notes:*
|
||||
- For a minimal experimental setup, you can build only the inference executable using:
|
||||
|
||||
```Powershell
|
||||
cmake --build build --config Release -j --target llama-completion
|
||||
```
|
||||
|
||||
###### - Generating a Visual Studio Solution
|
||||
|
||||
You can use Visual Studio solution to build and work on llama.cpp on Windows. You need to convert the CMake Project into a `.sln` file.
|
||||
|
||||
If you want to use the Intel C++ Compiler for the entire `llama.cpp` project, run the following command:
|
||||
|
||||
```Powershell
|
||||
cmake -B build -G "Visual Studio 17 2022" -T "Intel C++ Compiler 2025" -A x64 -DGGML_SYCL=ON -DCMAKE_BUILD_TYPE=Release
|
||||
```
|
||||
|
||||
If you prefer to use the Intel C++ Compiler only for `ggml-sycl`, ensure that `ggml` and its backend libraries are built as shared libraries ( i.e. `-DBUILD_SHARED_LIBRARIES=ON`, this is default behaviour):
|
||||
|
||||
```Powershell
|
||||
cmake -B build -G "Visual Studio 17 2022" -A x64 -DGGML_SYCL=ON -DCMAKE_BUILD_TYPE=Release \
|
||||
-DSYCL_INCLUDE_DIR="C:\Program Files (x86)\Intel\oneAPI\compiler\latest\include" \
|
||||
-DSYCL_LIBRARY_DIR="C:\Program Files (x86)\Intel\oneAPI\compiler\latest\lib"
|
||||
```
|
||||
|
||||
If successful the build files have been written to: *path/to/llama.cpp/build*
|
||||
Open the project file **build/llama.cpp.sln** with Visual Studio.
|
||||
|
||||
Once the Visual Studio solution is created, follow these steps:
|
||||
|
||||
1. Open the solution in Visual Studio.
|
||||
|
||||
2. Right-click on `ggml-sycl` and select **Properties**.
|
||||
|
||||
3. In the left column, expand **C/C++** and select **DPC++**.
|
||||
|
||||
4. In the right panel, find **Enable SYCL Offload** and set it to `Yes`.
|
||||
|
||||
5. Apply the changes and save.
|
||||
|
||||
|
||||
*Navigation Path:*
|
||||
|
||||
```
|
||||
Properties -> C/C++ -> DPC++ -> Enable SYCL Offload (Yes)
|
||||
```
|
||||
|
||||
Now, you can build `llama.cpp` with the SYCL backend as a Visual Studio project.
|
||||
To do it from menu: `Build -> Build Solution`.
|
||||
Once it is completed, final results will be in **build/Release/bin**
|
||||
|
||||
*Additional Note*
|
||||
|
||||
- You can avoid specifying `SYCL_INCLUDE_DIR` and `SYCL_LIBRARY_DIR` in the CMake command by setting the environment variables:
|
||||
|
||||
- `SYCL_INCLUDE_DIR_HINT`
|
||||
|
||||
- `SYCL_LIBRARY_DIR_HINT`
|
||||
|
||||
- Above instruction has been tested with Visual Studio 17 Community edition and oneAPI 2025.0. We expect them to work also with future version if the instructions are adapted accordingly.
|
||||
|
||||
### III. Run the inference
|
||||
|
||||
#### Retrieve and prepare model
|
||||
|
||||
You can refer to the general [*Obtaining and quantizing models*](../../README.md#obtaining-and-quantizing-models) guide for model preparation, or download an already quantized model like [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) or [Meta-Llama-3-8B-Instruct-Q4_0.gguf](https://huggingface.co/aptha/Meta-Llama-3-8B-Instruct-Q4_0-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf).
|
||||
|
||||
##### Check device
|
||||
|
||||
1. Enable oneAPI running environment
|
||||
|
||||
On the oneAPI command line window, run the following and step into the llama.cpp directory:
|
||||
```
|
||||
"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
|
||||
```
|
||||
|
||||
2. List devices information
|
||||
|
||||
Similar to the native `sycl-ls`, available SYCL devices can be queried as follow:
|
||||
|
||||
```
|
||||
build\bin\llama-ls-sycl-device.exe
|
||||
```
|
||||
|
||||
This command will only display the selected backend that is supported by SYCL. The default backend is level_zero. For example, in a system with 2 *Intel GPU* it would look like the following:
|
||||
```
|
||||
found 2 SYCL devices:
|
||||
| | | |Compute |Max compute|Max work|Max sub| |
|
||||
|ID| Device Type| Name|capability|units |group |group |Global mem size|
|
||||
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
|
||||
| 0|[level_zero:gpu:0]| Intel(R) Arc(TM) A770 Graphics| 1.3| 512| 1024| 32| 16225243136|
|
||||
| 1|[level_zero:gpu:1]| Intel(R) UHD Graphics 770| 1.3| 32| 512| 32| 53651849216|
|
||||
|
||||
```
|
||||
|
||||
##### Choose level-zero devices
|
||||
|
||||
|Chosen Device ID|Setting|
|
||||
|-|-|
|
||||
|0|Default option. You may also want to `set ONEAPI_DEVICE_SELECTOR="level_zero:0"`|
|
||||
|1|`set ONEAPI_DEVICE_SELECTOR="level_zero:1"`|
|
||||
|0 & 1|`set ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"` or `set ONEAPI_DEVICE_SELECTOR="level_zero:*"`|
|
||||
|
||||
##### Execute
|
||||
|
||||
Choose one of following methods to run.
|
||||
|
||||
1. Script
|
||||
|
||||
- Run test:
|
||||
|
||||
```
|
||||
examples\sycl\win-test.bat
|
||||
```
|
||||
|
||||
- Run llama-server:
|
||||
|
||||
```
|
||||
examples\sycl\win-start-svr.bat -m PATH\MODEL_FILE
|
||||
```
|
||||
|
||||
2. Command line
|
||||
|
||||
Launch inference
|
||||
|
||||
There are two device selection modes:
|
||||
|
||||
- Single device: Use one device assigned by user. Default device id is 0.
|
||||
- Multiple devices: Automatically choose the devices with the same backend.
|
||||
|
||||
In two device selection modes, the default SYCL backend is level_zero, you can choose other backend supported by SYCL by setting environment variable ONEAPI_DEVICE_SELECTOR.
|
||||
|
||||
| Device selection | Parameter |
|
||||
|------------------|----------------------------------------|
|
||||
| Single device | --split-mode none --main-gpu DEVICE_ID |
|
||||
| Multiple devices | --split-mode layer (default) |
|
||||
|
||||
Examples:
|
||||
|
||||
- Use device 0:
|
||||
|
||||
```
|
||||
build\bin\llama-completion.exe -no-cnv -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 99 -sm none -mg 0 --mmap
|
||||
```
|
||||
|
||||
- Use multiple devices:
|
||||
|
||||
```
|
||||
build\bin\llama-completion.exe -no-cnv -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 99 -sm layer --mmap
|
||||
```
|
||||
|
||||
|
||||
Note:
|
||||
|
||||
- Upon execution, verify the selected device(s) ID(s) in the output log, which can for instance be displayed as follow:
|
||||
|
||||
```sh
|
||||
detect 1 SYCL GPUs: [0] with top Max compute units:512
|
||||
```
|
||||
|
||||
Or
|
||||
|
||||
```sh
|
||||
use 1 SYCL GPUs: [0] with Max compute units:512
|
||||
```
|
||||
|
||||
|
||||
## Environment Variable
|
||||
|
||||
### Build
|
||||
|
||||
| Name | Value | Function |
|
||||
|--------------------|---------------------------------------|---------------------------------------------|
|
||||
| GGML_SYCL | ON (mandatory) | Enable build with SYCL code path. |
|
||||
| GGML_SYCL_TARGET | INTEL *(default)* | Set the SYCL target device type. |
|
||||
| GGML_SYCL_DEVICE_ARCH | Optional | Set the SYCL device architecture. Setting the device architecture can improve the performance. See the table [--offload-arch](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/OffloadDesign.md#--offload-arch) for a list of valid architectures. |
|
||||
| GGML_SYCL_F16 | OFF *(default)* \|ON *(optional)* | Enable FP16 build with SYCL code path. (1.) |
|
||||
| GGML_SYCL_GRAPH | OFF *(default)* \|ON *(Optional)* | Enable build with [SYCL Graph extension](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc). |
|
||||
| GGML_SYCL_DNN | ON *(default)* \|OFF *(Optional)* | Enable build with oneDNN. |
|
||||
| GGML_SYCL_HOST_MEM_FALLBACK | ON *(default)* \|OFF *(Optional)* | Allow host memory fallback when device memory is full during quantized weight reorder. Enables inference to continue at reduced speed (reading over PCIe) instead of failing. Requires Linux kernel 6.8+. |
|
||||
| CMAKE_C_COMPILER | `icx` *(Linux)*, `icx/cl` *(Windows)* | Set `icx` compiler for SYCL code path. |
|
||||
| CMAKE_CXX_COMPILER | `icpx` *(Linux)*, `icx` *(Windows)* | Set `icpx/icx` compiler for SYCL code path. |
|
||||
|
||||
1. FP32 or FP16 have different performance impact to LLM. Recommended to test them for better prompt processing performance on your models. You need to rebuild the code after change `GGML_SYCL_F16=OFF/ON`.
|
||||
|
||||
### Runtime
|
||||
|
||||
| Name | Value | Function |
|
||||
|-------------------|------------------|---------------------------------------------------------------------------------------------------------------------------|
|
||||
| GGML_SYCL_DEBUG | 0 (default) or 1 | Enable log function by macro: GGML_SYCL_DEBUG |
|
||||
| GGML_SYCL_ENABLE_FLASH_ATTN | 1 (default) or 0| Enable Flash-Attention. It can reduce memory usage. The performance impact depends on the LLM.|
|
||||
| GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features for Intel GPUs. (Recommended to 1 for intel devices older than Gen 10) |
|
||||
| GGML_SYCL_DISABLE_GRAPH | 0 or 1 (default) | Disable running computations through SYCL Graphs feature. Disabled by default because SYCL Graph is still on development, no better performance. |
|
||||
| GGML_SYCL_DISABLE_DNN | 0 (default) or 1 | Disable running computations through oneDNN and always use oneMKL. |
|
||||
| ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.<br>Recommended to use when --split-mode = layer |
|
||||
| UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS | 0 (default) or 1 | Support malloc device memory more than 4GB.|
|
||||
|
||||
## Compile-time Flags
|
||||
|
||||
Pass these via `CXXFLAGS` or add a one-off `#define` to enable a flag on the spot.
|
||||
|
||||
| Name | Function |
|
||||
|-----------------|----------------------------------------------------------------------------------|
|
||||
| DEBUG_SYCL_POOL | Enable device memory pool logging on teardown. Useful for profiling allocations. |
|
||||
|
||||
## Design Rule
|
||||
|
||||
- Open to all contributors.
|
||||
|
||||
- All code change should be useful to user:
|
||||
- Fix bug.
|
||||
- Add new function.
|
||||
- Improve the performance/usage.
|
||||
- Make code be easy to maintain.
|
||||
- ...
|
||||
|
||||
- Don't accept the codes of following cases:
|
||||
- Break legacy function.
|
||||
- Reduce the performance of legacy case in default.
|
||||
- Not completed work/the functionality cannot be demonstrated.
|
||||
|
||||
- Encourage to use environment variable to control features to be opened/closed.
|
||||
- User can evaluate the feature without rebuild the code.
|
||||
- Recommend the best features to user by setting them be opened as default.
|
||||
|
||||
- Design the code based on the published official releases of oneAPI packages: compiler, library, driver, OS kernel.
|
||||
|
||||
- Developers need to maintain the code they submit.
|
||||
|
||||
## Known Issues
|
||||
|
||||
- `Split-mode:[row]` is not supported.
|
||||
|
||||
- Missed the AOT (Ahead-of-Time) in buiding.
|
||||
- Good: build quickly, smaller size of binary file.
|
||||
- Bad: The startup is slow (JIT) in first time, but subsequent performance is unaffected.
|
||||
|
||||
## Q&A
|
||||
|
||||
- Error: `error while loading shared libraries: libsycl.so: cannot open shared object file: No such file or directory`.
|
||||
|
||||
- Potential cause: Unavailable oneAPI installation or not set ENV variables.
|
||||
- Solution: Install *oneAPI base toolkit* and enable its ENV through: `source /opt/intel/oneapi/setvars.sh`.
|
||||
|
||||
- General compiler error:
|
||||
|
||||
- Remove **build** folder or try a clean-build.
|
||||
|
||||
- I can **not** see `[ext_oneapi_level_zero:gpu]` after installing the GPU driver on Linux.
|
||||
|
||||
Please double-check with `sudo sycl-ls`.
|
||||
|
||||
If it's present in the list, please add video/render group to your user then **logout/login** or restart your system:
|
||||
|
||||
```
|
||||
sudo usermod -aG render $USER
|
||||
sudo usermod -aG video $USER
|
||||
```
|
||||
Otherwise, please double-check the GPU driver installation steps.
|
||||
|
||||
- Can I report Ollama issue on Intel GPU to llama.cpp SYCL backend?
|
||||
|
||||
No. We can't support Ollama issue directly, because we aren't familiar with Ollama.
|
||||
|
||||
Suggest reproducing on llama.cpp and report similar issue to llama.cpp. We will support it.
|
||||
|
||||
It's same for other projects including llama.cpp SYCL backend.
|
||||
|
||||
- `Native API failed. Native API returns: 39 (UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY)`, `ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 3503030272 Bytes of memory on device`, or `failed to allocate SYCL0 buffer`
|
||||
|
||||
You are running out of Device Memory.
|
||||
|
||||
|Reason|Solution|
|
||||
|-|-|
|
||||
| The default context is too big. It leads to excessive memory usage.|Set `-c 8192` or a smaller value.|
|
||||
| The model is too big and requires more memory than what is available.|Choose a smaller model or change to a smaller quantization, like Q5 -> Q4;<br>Alternatively, use more than one device to load model.|
|
||||
|
||||
- `ggml_backend_sycl_buffer_type_alloc_buffer: can't allocate 5000000000 Bytes of memory on device`
|
||||
|
||||
You need to enable to support 4GB memory malloc by:
|
||||
```
|
||||
export UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
|
||||
set UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
|
||||
```
|
||||
|
||||
### **GitHub contribution**:
|
||||
Please add the `[SYCL]` prefix/tag in issues/PRs titles to help the SYCL contributors to check/address them without delay.
|
||||
|
||||
## TODO
|
||||
|
||||
- Review ZES_ENABLE_SYSMAN: https://github.com/intel/compute-runtime/blob/master/programmers-guide/SYSMAN.md#support-and-limitations
|
||||
182
docs/backend/VirtGPU.md
Normal file
182
docs/backend/VirtGPU.md
Normal file
@@ -0,0 +1,182 @@
|
||||
# GGML-VirtGPU Backend
|
||||
|
||||
The GGML-VirtGPU backend enables GGML applications to run machine
|
||||
learning computations on host hardware while the application itself
|
||||
runs inside a virtual machine. It uses host-guest shared memory to
|
||||
efficiently share data buffers between the two sides.
|
||||
|
||||
This backend relies on the virtio-gpu, and VirglRenderer API Remoting
|
||||
(APIR) component. The backend is split into two libraries:
|
||||
- a GGML implementation (the "remoting frontend"), running in the
|
||||
guest and interacting with the virtgpu device
|
||||
- a VirglRenderer APIR compatible library (the "remoting backend"),
|
||||
running in the host and interacting with Virglrenderer and an actual
|
||||
GGML device backend.
|
||||
|
||||
## OS support
|
||||
|
||||
| OS | Status | Backend | CI testing | Notes
|
||||
| -------- | ----------------- | ----------- | ----------- | -----
|
||||
| MacOS 14 | Supported | ggml-metal | X | Working when compiled on MacOS 14
|
||||
| MacOS 15 | Supported | ggml-metal | X | Working when compiled on MacOS 14 or MacOS 15
|
||||
| MacOS 26 | Not tested | | |
|
||||
| Linux | Under development | ggml-vulkan | not working | Working locally, CI running into deadlocks
|
||||
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
The GGML-VirtGPU backend consists of three main components:
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
%% Nodes
|
||||
|
||||
subgraph GuestVM ["Guest VM - Frontend"]
|
||||
App([GGML Application<br/>llama.cpp, etc.])
|
||||
|
||||
direction TB
|
||||
Interface[GGML Backend Interface]
|
||||
Comm["GGML-VirtGPU<br/>(hypercalls + shared mem)"]
|
||||
|
||||
App --> Interface
|
||||
Interface --> Comm
|
||||
end
|
||||
|
||||
API[virtio-gpu / virglrenderer API]
|
||||
|
||||
subgraph HostSystem [Host System - Backend]
|
||||
direction TB
|
||||
Dispatcher[GGML-VirtGPU-Backend]
|
||||
BackendLib[GGML Backend library<br/>Metal / Vulkan / CPU / ...]
|
||||
|
||||
Dispatcher --> BackendLib
|
||||
end
|
||||
|
||||
%% Connections
|
||||
Comm --> API
|
||||
API --> HostSystem
|
||||
```
|
||||
|
||||
### Key Components
|
||||
|
||||
1. **Guest-side Frontend** (`ggml-virtgpu/`): Implements the GGML backend interface and forwards operations to the host
|
||||
2. **Host-side Backend** (`ggml-virtgpu/backend/`): Receives forwarded operations and executes them on actual hardware backends
|
||||
3. **Communication Layer**: Uses virtio-gpu hypercalls and shared memory for efficient data transfer
|
||||
|
||||
## Features
|
||||
|
||||
- **Dynamic backend loading** on the host side (CPU, CUDA, Metal, etc.)
|
||||
- **Zero-copy data transfer** via host-guest shared memory pages
|
||||
|
||||
## Communication Protocol
|
||||
|
||||
### Hypercalls and Shared Memory
|
||||
|
||||
The backend uses two primary communication mechanisms:
|
||||
|
||||
1. **Hypercalls (`DRM_IOCTL_VIRTGPU_EXECBUFFER`)**: Trigger remote execution from guest to host
|
||||
2. **Shared Memory Pages**: Zero-copy data transfer for tensors and parameters
|
||||
|
||||
#### Shared Memory Layout
|
||||
|
||||
Each connection uses two shared memory buffers:
|
||||
|
||||
- **Data Buffer** (24 MiB): For command/response data and tensor transfers
|
||||
- **Reply Buffer** (16 KiB): For command replies and status information
|
||||
- **Data Buffers**: Dynamically allocated host-guest shared buffers
|
||||
served as GGML buffers.
|
||||
|
||||
### APIR Protocol
|
||||
|
||||
The Virglrender API Remoting protocol defines three command types:
|
||||
|
||||
- `HANDSHAKE`: Protocol version negotiation and capability discovery
|
||||
- `LOADLIBRARY`: Dynamic loading of backend libraries on the host
|
||||
- `FORWARD`: API function call forwarding
|
||||
|
||||
### Binary Serialization
|
||||
|
||||
Commands and data are serialized using a custom binary protocol with:
|
||||
|
||||
- Fixed-size encoding for basic types
|
||||
- Variable-length arrays with size prefixes
|
||||
- Buffer bounds checking
|
||||
- Error recovery mechanisms
|
||||
|
||||
## Supported Operations
|
||||
|
||||
### Device Operations
|
||||
- Device enumeration and capability queries
|
||||
- Memory information (total/free)
|
||||
- Backend type detection
|
||||
|
||||
### Buffer Operations
|
||||
- Buffer allocation and deallocation
|
||||
- Tensor data transfer (host ↔ guest)
|
||||
- Memory copying and clearing
|
||||
|
||||
### Computation Operations
|
||||
- Graph execution forwarding
|
||||
|
||||
## Build Requirements
|
||||
|
||||
### Guest-side Dependencies
|
||||
- `libdrm` for DRM/virtio-gpu communication
|
||||
- C++20 compatible compiler
|
||||
- CMake 3.14+
|
||||
|
||||
### Host-side Dependencies
|
||||
- virglrenderer with APIR support (pending upstream review)
|
||||
- Target backend libraries (libggml-metal, libggml-vulkan, etc.)
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
- `GGML_VIRTGPU_BACKEND_LIBRARY`: Path to the host-side backend library
|
||||
- `GGML_VIRTGPU_DEBUG`: Enable debug logging
|
||||
|
||||
### Build Options
|
||||
|
||||
- `GGML_VIRTGPU`: Enable the VirtGPU backend (`ON` or `OFF`, default: `OFF`)
|
||||
- `GGML_VIRTGPU_BACKEND`: Build the host-side backend component (`ON`, `OFF` or `ONLY`, default: `OFF`)
|
||||
|
||||
### System Requirements
|
||||
|
||||
- VM with virtio-gpu support
|
||||
- VirglRenderer with APIR patches
|
||||
- Compatible backend libraries on host
|
||||
|
||||
## Limitations
|
||||
|
||||
- **VM-specific**: Only works in virtual machines with virtio-gpu support
|
||||
- **Host dependency**: Requires properly configured host-side backend
|
||||
- **Latency**: Small overhead from VM escaping for each operation
|
||||
- **Shared-memory size**: with the `libkrun` hypervisor, the RAM + VRAM
|
||||
addressable memory is limited to 64 GB. So the maximum GPU memory
|
||||
will be `64GB - RAM`, regardless of the hardware VRAM size.
|
||||
|
||||
* This work is pending upstream changes in the VirglRenderer
|
||||
project.
|
||||
* The backend can be tested with Virglrenderer compiled from source
|
||||
using this PR:
|
||||
https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/1590
|
||||
* This work is pending changes in the VMM/hypervisor running the
|
||||
virtual machine, which need to know how to route the newly
|
||||
introduced APIR capset.
|
||||
* The environment variable `VIRGL_ROUTE_VENUS_TO_APIR=1` allows
|
||||
using the Venus capset, until the relevant hypervisors have been
|
||||
patched. However, setting this flag breaks the Vulkan/Venus normal
|
||||
behavior.
|
||||
* The environment variable `GGML_REMOTING_USE_APIR_CAPSET` tells the
|
||||
`ggml-virtgpu` backend to use the APIR capset. This will become
|
||||
the default when the relevant hypervisors have been patched.
|
||||
|
||||
* This work focused on improving the performance of llama.cpp running
|
||||
on MacOS containers, and is mainly tested on this platform. The
|
||||
linux support (via `krun`) is in progress.
|
||||
|
||||
## See Also
|
||||
|
||||
- [Development and Testing](VirtGPU/development.md)
|
||||
- [Backend configuration](VirtGPU/configuration.md)
|
||||
174
docs/backend/VirtGPU/configuration.md
Normal file
174
docs/backend/VirtGPU/configuration.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# GGML-VirtGPU Backend Configuration
|
||||
|
||||
This document describes the environment variables used by the ggml-virtgpu backend system, covering both the frontend (guest-side) and backend (host-side) components.
|
||||
|
||||
## Environment Variables Overview
|
||||
|
||||
The ggml-virtgpu backend uses environment variables for configuration across three main components:
|
||||
- **Frontend (Guest)**: GGML applications running in VMs
|
||||
- **Hypervisor**: Virglrenderer/APIR system
|
||||
- **Backend (Host)**: Host-side GGML backend integration
|
||||
|
||||
## Frontend (Guest-side) Configuration
|
||||
|
||||
### GGML_REMOTING_USE_APIR_CAPSET
|
||||
- **Location**: `ggml/src/ggml-virtgpu/virtgpu.cpp`
|
||||
- **Type**: Boolean flag (presence-based)
|
||||
- **Purpose**: Controls which virtio-gpu capability set to use for communication
|
||||
- **Values**:
|
||||
- Set (any value): Use the APIR capset (long-term setup)
|
||||
- Unset: Use the Venus capset (easier for testing with an unmodified hypervisor)
|
||||
- **Default**: Unset (Venus capset)
|
||||
- **Usage**:
|
||||
```bash
|
||||
export GGML_REMOTING_USE_APIR_CAPSET=1 # Use APIR capset
|
||||
# or leave unset for Venus capset
|
||||
```
|
||||
|
||||
## Hypervisor (Virglrenderer/APIR) Configuration
|
||||
|
||||
These environment variables are used during the transition phase for
|
||||
running with an unmodified hypervisor (not supporting the
|
||||
VirglRenderer APIR component). They will be removed in the future, and
|
||||
the hypervisor will instead configure VirglRenderer with the APIR
|
||||
_Configuration Key_.
|
||||
|
||||
### VIRGL_APIR_BACKEND_LIBRARY
|
||||
- **Location**: `virglrenderer/src/apir/apir-context.c`
|
||||
- **Configuration Key**: `apir.load_library.path`
|
||||
- **Type**: File path string
|
||||
- **Purpose**: Path to the APIR backend library that virglrenderer should dynamically load
|
||||
- **Required**: Yes
|
||||
- **Example**:
|
||||
```bash
|
||||
export VIRGL_APIR_BACKEND_LIBRARY="/path/to/libggml-remotingbackend.so"
|
||||
```
|
||||
|
||||
### VIRGL_ROUTE_VENUS_TO_APIR
|
||||
- **Location**: `virglrenderer/src/apir/apir-renderer.h`
|
||||
- **Type**: Boolean flag (presence-based)
|
||||
- **Purpose**: Temporary workaround to route Venus capset calls to APIR during hypervisor transition period
|
||||
- **Status**: will be removed once hypervisors support APIR natively
|
||||
- **Warning**: Breaks normal Vulkan/Venus functionality
|
||||
- **Usage**:
|
||||
```bash
|
||||
export VIRGL_ROUTE_VENUS_TO_APIR=1 # For testing with an unmodified hypervisor
|
||||
```
|
||||
|
||||
### VIRGL_APIR_LOG_TO_FILE
|
||||
- **Location**: `virglrenderer/src/apir/apir-renderer.c`
|
||||
- **Environment Variable**: `VIRGL_APIR_LOG_TO_FILE`
|
||||
- **Type**: File path string
|
||||
- **Purpose**: Enable debug logging from the VirglRenderer APIR component to specified file
|
||||
- **Required**: No (optional debugging)
|
||||
- **Default**: Logging to `stderr`
|
||||
- **Usage**:
|
||||
```bash
|
||||
export VIRGL_APIR_LOG_TO_FILE="/tmp/apir-debug.log"
|
||||
```
|
||||
|
||||
## Backend (Host-side) Configuration
|
||||
|
||||
These environment variables are used during the transition phase for
|
||||
running with an unmodified hypervisor (not supporting the
|
||||
VirglRenderer APIR component). They will be removed in the future, and
|
||||
the hypervisor will instead configure VirglRenderer with the APIR
|
||||
_Configuration Key_.
|
||||
|
||||
### APIR_LLAMA_CPP_GGML_LIBRARY_PATH
|
||||
- **Location**: `ggml/src/ggml-virtgpu/backend/backend.cpp`
|
||||
- **Environment Variable**: `APIR_LLAMA_CPP_GGML_LIBRARY_PATH`
|
||||
- **Configuration Key**: `ggml.library.path`
|
||||
- **Type**: File path string
|
||||
- **Purpose**: Path to the actual GGML backend library (Metal, CUDA, Vulkan, etc.)
|
||||
- **Required**: **Yes** - backend initialization fails without this
|
||||
- **Examples**:
|
||||
```bash
|
||||
# macOS with Metal backend
|
||||
export APIR_LLAMA_CPP_GGML_LIBRARY_PATH="/opt/llama.cpp/lib/libggml-metal.dylib"
|
||||
|
||||
# Linux with CUDA backend
|
||||
export APIR_LLAMA_CPP_GGML_LIBRARY_PATH="/opt/llama.cpp/lib/libggml-cuda.so"
|
||||
|
||||
# macOS or Linux with Vulkan backend
|
||||
export APIR_LLAMA_CPP_GGML_LIBRARY_PATH="/opt/llama.cpp/lib/libggml-vulkan.so"
|
||||
```
|
||||
|
||||
### APIR_LLAMA_CPP_GGML_LIBRARY_REG
|
||||
- **Location**: `ggml/src/ggml-virtgpu/backend/backend.cpp`
|
||||
- **Environment Variable**: `APIR_LLAMA_CPP_GGML_LIBRARY_REG`
|
||||
- **Configuration Key**: `ggml.library.reg`
|
||||
- **Type**: Function symbol name string
|
||||
- **Purpose**: Name of the backend registration function to call after loading the library
|
||||
- **Required**: No (defaults to `ggml_backend_init`)
|
||||
- **Default**: `ggml_backend_init`
|
||||
- **Examples**:
|
||||
```bash
|
||||
# Metal backend
|
||||
export APIR_LLAMA_CPP_GGML_LIBRARY_REG="ggml_backend_metal_reg"
|
||||
|
||||
# CUDA backend
|
||||
export APIR_LLAMA_CPP_GGML_LIBRARY_REG="ggml_backend_cuda_reg"
|
||||
|
||||
# Vulkan backend
|
||||
export APIR_LLAMA_CPP_GGML_LIBRARY_REG="ggml_backend_vulkan_reg"
|
||||
|
||||
# Generic fallback (default)
|
||||
# export APIR_LLAMA_CPP_GGML_LIBRARY_REG="ggml_backend_init"
|
||||
```
|
||||
|
||||
### APIR_LLAMA_CPP_LOG_TO_FILE
|
||||
- **Location**: `ggml/src/ggml-virtgpu/backend/backend.cpp:62`
|
||||
- **Environment Variable**: `APIR_LLAMA_CPP_LOG_TO_FILE`
|
||||
- **Type**: File path string
|
||||
- **Purpose**: Enable debug logging from the GGML backend to specified file
|
||||
- **Required**: No (optional debugging)
|
||||
- **Usage**:
|
||||
```bash
|
||||
export APIR_LLAMA_CPP_LOG_TO_FILE="/tmp/ggml-backend-debug.log"
|
||||
```
|
||||
|
||||
## Configuration Flow
|
||||
|
||||
The configuration system works as follows:
|
||||
|
||||
1. **Hypervisor Setup**: Virglrenderer loads the APIR backend library specified by `VIRGL_APIR_BACKEND_LIBRARY`
|
||||
|
||||
2. **Context Creation**: When an APIR context is created, it populates a configuration table with environment variables:
|
||||
- `apir.load_library.path` ← `VIRGL_APIR_BACKEND_LIBRARY`
|
||||
- `ggml.library.path` ← `APIR_LLAMA_CPP_GGML_LIBRARY_PATH`
|
||||
- `ggml.library.reg` ← `APIR_LLAMA_CPP_GGML_LIBRARY_REG`
|
||||
- this step will eventually be performed by the hypervisor itself, with command-line arguments instead of environment variables.
|
||||
|
||||
3. **Backend Initialization**: The backend queries the configuration via callbacks:
|
||||
- `virgl_cbs->get_config(ctx_id, "ggml.library.path")` returns the library path
|
||||
- `virgl_cbs->get_config(ctx_id, "ggml.library.reg")` returns the registration function
|
||||
|
||||
4. **Library Loading**: The backend dynamically loads and initializes the specified GGML library
|
||||
|
||||
## Error Messages
|
||||
|
||||
Common error scenarios and their messages:
|
||||
|
||||
- **Missing library path**: `"cannot open the GGML library: env var 'APIR_LLAMA_CPP_GGML_LIBRARY_PATH' not defined"`
|
||||
- **Missing registration function**: `"cannot register the GGML library: env var 'APIR_LLAMA_CPP_GGML_LIBRARY_REG' not defined"`
|
||||
|
||||
## Example Complete Configuration
|
||||
|
||||
Here's an example configuration for a macOS host with Metal backend:
|
||||
|
||||
```bash
|
||||
# Hypervisor environment
|
||||
export VIRGL_APIR_BACKEND_LIBRARY="/opt/llama.cpp/lib/libggml-virtgpu-backend.dylib"
|
||||
|
||||
# Backend configuration
|
||||
export APIR_LLAMA_CPP_GGML_LIBRARY_PATH="/opt/llama.cpp/lib/libggml-metal.dylib"
|
||||
export APIR_LLAMA_CPP_GGML_LIBRARY_REG="ggml_backend_metal_reg"
|
||||
|
||||
# Optional logging
|
||||
export VIRGL_APIR_LOG_TO_FILE="/tmp/apir.log"
|
||||
export APIR_LLAMA_CPP_LOG_TO_FILE="/tmp/ggml.log"
|
||||
|
||||
# Guest configuration
|
||||
export GGML_REMOTING_USE_APIR_CAPSET=1
|
||||
```
|
||||
222
docs/backend/VirtGPU/development.md
Normal file
222
docs/backend/VirtGPU/development.md
Normal file
@@ -0,0 +1,222 @@
|
||||
# Development and Testing
|
||||
|
||||
## Development
|
||||
|
||||
### Code Generation
|
||||
|
||||
The backend uses code generation from YAML configuration:
|
||||
|
||||
```bash
|
||||
# Regenerate protocol code
|
||||
cd ggml-virtgpu/
|
||||
python regenerate_remoting.py
|
||||
```
|
||||
|
||||
### Adding New Operations
|
||||
|
||||
1. Add function definition to `ggmlremoting_functions.yaml`
|
||||
2. Regenerate code with `regenerate_remoting.py`
|
||||
3. Implement guest-side forwarding in `virtgpu-forward-*.cpp`
|
||||
4. Implement host-side handling in `backend-dispatched-*.cpp`
|
||||
|
||||
## Testing
|
||||
|
||||
This document provides instructions for building and testing the GGML-VirtGPU backend on macOS with containers.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
The testing setup requires:
|
||||
|
||||
- macOS host system
|
||||
- Container runtime with `libkrun` provider (podman machine)
|
||||
- Access to development patchset for VirglRenderer
|
||||
|
||||
### Required Patchsets
|
||||
|
||||
The backend requires patches that are currently under review:
|
||||
|
||||
- **Virglrenderer APIR upstream PR**: https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/1590 (for reference)
|
||||
- **MacOS Virglrenderer (for krunkit)**: https://gitlab.freedesktop.org/kpouget/virglrenderer/-/tree/main-macos
|
||||
- **Linux Virglrenderer (for krun)**: https://gitlab.freedesktop.org/kpouget/virglrenderer/-/tree/main-linux
|
||||
|
||||
### Build Instructions
|
||||
|
||||
#### 1. Build ggml-virtgpu-backend (Host-side, macOS)
|
||||
|
||||
```bash
|
||||
# Build the backend that runs natively on macOS
|
||||
mkdir llama.cpp
|
||||
cd llama.cpp
|
||||
git clone https://github.com/ggml-org/llama.cpp.git src
|
||||
cd src
|
||||
|
||||
LLAMA_MAC_BUILD=$PWD/build/ggml-virtgpu-backend
|
||||
|
||||
cmake -S . -B $LLAMA_MAC_BUILD \
|
||||
-DGGML_NATIVE=OFF \
|
||||
-DLLAMA_CURL=ON \
|
||||
-DGGML_VIRTGPU=ON \
|
||||
-DGGML_VIRTGPU_BACKEND=ONLY \
|
||||
-DGGML_METAL=ON
|
||||
|
||||
TARGETS="ggml-metal"
|
||||
cmake --build $LLAMA_MAC_BUILD --parallel 8 --target $TARGETS
|
||||
|
||||
# Build additional tools for native benchmarking
|
||||
EXTRA_TARGETS="llama-run llama-bench"
|
||||
cmake --build $LLAMA_MAC_BUILD --parallel 8 --target $EXTRA_TARGETS
|
||||
```
|
||||
|
||||
#### 2. Build virglrenderer (Host-side, macOS)
|
||||
|
||||
```bash
|
||||
# Build virglrenderer with APIR support
|
||||
mkdir virglrenderer
|
||||
cd virglrenderer
|
||||
git clone https://gitlab.freedesktop.org/kpouget/virglrenderer -b main-macos src
|
||||
cd src
|
||||
|
||||
VIRGL_BUILD_DIR=$PWD/build
|
||||
|
||||
# -Dvenus=true and VIRGL_ROUTE_VENUS_TO_APIR=1 route the APIR requests via the Venus backend, for easier testing without a patched hypervisor
|
||||
|
||||
meson setup $VIRGL_BUILD_DIR \
|
||||
-Dvenus=true \
|
||||
-Dapir=true
|
||||
|
||||
ninja -C $VIRGL_BUILD_DIR
|
||||
```
|
||||
|
||||
#### 3. Build ggml-virtgpu (Guest-side, Linux)
|
||||
|
||||
Option A: Build from a script:
|
||||
|
||||
```bash
|
||||
# Inside a Linux container
|
||||
mkdir llama.cpp
|
||||
git clone https://github.com/ggml-org/llama.cpp.git src
|
||||
cd src
|
||||
|
||||
LLAMA_LINUX_BUILD=$PWD/build-virtgpu
|
||||
|
||||
cmake -S . -B $LLAMA_LINUX_BUILD \
|
||||
-DGGML_VIRTGPU=ON
|
||||
|
||||
ninja -C $LLAMA_LINUX_BUILD
|
||||
```
|
||||
|
||||
Option B: Build container image with frontend:
|
||||
|
||||
```bash
|
||||
cat << EOF > remoting.containerfile
|
||||
FROM quay.io/fedora/fedora:43
|
||||
USER 0
|
||||
|
||||
WORKDIR /app/remoting
|
||||
|
||||
ARG LLAMA_CPP_REPO="https://github.com/ggml-org/llama.cpp.git"
|
||||
ARG LLAMA_CPP_VERSION="master"
|
||||
ARG LLAMA_CPP_CMAKE_FLAGS="-DGGML_VIRTGPU=ON"
|
||||
ARG LLAMA_CPP_CMAKE_BUILD_FLAGS="--parallel 4"
|
||||
|
||||
RUN dnf install -y git cmake gcc gcc-c++ libcurl-devel libdrm-devel
|
||||
|
||||
RUN git clone "\${LLAMA_CPP_REPO}" src \\
|
||||
&& git -C src fetch origin \${LLAMA_CPP_VERSION} \\
|
||||
&& git -C src reset --hard FETCH_HEAD
|
||||
|
||||
RUN mkdir -p build \\
|
||||
&& cd src \\
|
||||
&& set -o pipefail \\
|
||||
&& cmake -S . -B ../build \${LLAMA_CPP_CMAKE_FLAGS} \\
|
||||
&& cmake --build ../build/ \${LLAMA_CPP_CMAKE_BUILD_FLAGS}
|
||||
|
||||
ENTRYPOINT ["/app/remoting/src/build/bin/llama-server"]
|
||||
EOF
|
||||
|
||||
mkdir -p empty_dir
|
||||
podman build -f remoting.containerfile ./empty_dir -t localhost/llama-cpp.virtgpu
|
||||
```
|
||||
|
||||
### Environment Setup
|
||||
|
||||
#### Set krunkit Environment Variables
|
||||
|
||||
```bash
|
||||
# Define the base directories (adapt these paths to your system)
|
||||
VIRGL_BUILD_DIR=$HOME/remoting/virglrenderer/build
|
||||
LLAMA_MAC_BUILD=$HOME/remoting/llama.cpp/build-backend
|
||||
|
||||
# For krunkit to load the custom virglrenderer library
|
||||
export DYLD_LIBRARY_PATH=$VIRGL_BUILD_DIR/src
|
||||
|
||||
# For Virglrenderer to load the ggml-remotingbackend library
|
||||
export VIRGL_APIR_BACKEND_LIBRARY="$LLAMA_MAC_BUILD/bin/libggml-virtgpu-backend.dylib"
|
||||
|
||||
# For llama.cpp remotingbackend to load the ggml-metal backend
|
||||
export APIR_LLAMA_CPP_GGML_LIBRARY_PATH="$LLAMA_MAC_BUILD/bin/libggml-metal.dylib"
|
||||
export APIR_LLAMA_CPP_GGML_LIBRARY_REG=ggml_backend_metal_reg
|
||||
```
|
||||
|
||||
#### Launch Container Environment
|
||||
|
||||
```bash
|
||||
# Set container provider to libkrun
|
||||
export CONTAINERS_MACHINE_PROVIDER=libkrun
|
||||
podman machine start
|
||||
```
|
||||
|
||||
#### Verify Environment
|
||||
|
||||
Confirm that krunkit is using the correct virglrenderer library:
|
||||
|
||||
```bash
|
||||
lsof -c krunkit | grep virglrenderer
|
||||
# Expected output:
|
||||
# krunkit 50574 user txt REG 1,14 2273912 10849442 ($VIRGL_BUILD_DIR/src)/libvirglrenderer.1.dylib
|
||||
```
|
||||
|
||||
### Running Tests
|
||||
|
||||
#### Launch Test Container
|
||||
|
||||
```bash
|
||||
# Optional model caching
|
||||
mkdir -p models
|
||||
PODMAN_CACHE_ARGS="-v models:/models --user root:root --cgroupns host --security-opt label=disable -w /models"
|
||||
|
||||
podman run $PODMAN_CACHE_ARGS -it --rm --device /dev/dri localhost/llama-cpp.virtgpu
|
||||
```
|
||||
|
||||
#### Test llama.cpp in Container
|
||||
|
||||
```bash
|
||||
|
||||
# Run performance benchmark
|
||||
/app/remoting/build/bin/llama-bench -m ./llama3.2
|
||||
```
|
||||
|
||||
Expected output (performance may vary):
|
||||
```
|
||||
| model | size | params | backend | ngl | test | t/s |
|
||||
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
|
||||
| llama 3B Q4_K - Medium | 1.87 GiB | 3.21 B | ggml-virtgpu | 99 | pp512 | 991.30 ± 0.66 |
|
||||
| llama 3B Q4_K - Medium | 1.87 GiB | 3.21 B | ggml-virtgpu | 99 | tg128 | 85.71 ± 0.11 |
|
||||
```
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
#### SSH Environment Variable Issues
|
||||
|
||||
⚠️ **Warning**: Setting `DYLD_LIBRARY_PATH` from SSH doesn't work on macOS. Here is a workaround:
|
||||
|
||||
**Workaround 1: Replace system library**
|
||||
```bash
|
||||
VIRGL_BUILD_DIR=$HOME/remoting/virglrenderer/build # ⚠️ adapt to your system
|
||||
BREW_VIRGL_DIR=/opt/homebrew/Cellar/virglrenderer/0.10.4d/lib
|
||||
VIRGL_LIB=libvirglrenderer.1.dylib
|
||||
|
||||
cd $BREW_VIRGL_DIR
|
||||
mv $VIRGL_LIB ${VIRGL_LIB}.orig
|
||||
ln -s $VIRGL_BUILD_DIR/src/$VIRGL_LIB
|
||||
```
|
||||
220
docs/backend/ZenDNN.md
Normal file
220
docs/backend/ZenDNN.md
Normal file
@@ -0,0 +1,220 @@
|
||||
# llama.cpp for AMD ZenDNN
|
||||
|
||||
> [!WARNING]
|
||||
> **Note:** ZenDNN is **not** the same as zDNN.
|
||||
> - **ZenDNN** (this page): AMD's deep learning library for AMD EPYC CPUs
|
||||
> - **zDNN**: IBM's Deep Neural Network acceleration library for IBM Z & LinuxONE Mainframes ([see zDNN documentation](zDNN.md))
|
||||
|
||||
- [Background](#background)
|
||||
- [OS](#os)
|
||||
- [Hardware](#hardware)
|
||||
- [Supported Operations](#supported-operations)
|
||||
- [DataType Supports](#datatype-supports)
|
||||
- [Linux](#linux)
|
||||
- [Environment Variable](#environment-variable)
|
||||
- [Performance Optimization](#performance-optimization)
|
||||
- [Known Issues](#known-issues)
|
||||
- [TODO](#todo)
|
||||
|
||||
## Background
|
||||
|
||||
**ZenDNN** (Zen Deep Neural Network Library) is AMD's high-performance deep learning inference library optimized for AMD EPYC™ CPUs. It provides optimized implementations of key deep learning primitives and operations, delivering significant performance improvements for neural network workloads on AMD Zen-based processor architectures.
|
||||
|
||||
**Llama.cpp + ZenDNN**
|
||||
|
||||
The llama.cpp ZenDNN backend leverages AMD's optimized matrix multiplication primitives to accelerate inference on AMD CPUs. It utilizes ZenDNN's **LowOHA (Low Overhead Hardware Accelerated)** MatMul operator for efficient GEMM operations with minimal execution overhead, built-in weight caching, and direct access to backend libraries (AOCL DLP, LibXSMM, OneDNN).
|
||||
|
||||
For more information about ZenDNN, visit: https://www.amd.com/en/developer/zendnn.html
|
||||
|
||||
## OS
|
||||
|
||||
| OS | Status | Verified |
|
||||
|:-------:|:-------:|:----------------------------------------------:|
|
||||
| Linux | Support | Ubuntu 20.04, 22.04, 24.04 |
|
||||
|
||||
For the latest list of supported operating systems, see the [ZenDNN Supported OS](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/README.md#15-supported-os).
|
||||
|
||||
## Hardware
|
||||
|
||||
### AMD CPUs
|
||||
|
||||
**Recommended Processors**
|
||||
|
||||
ZenDNN is optimized for AMD EPYC™ processors and AMD Ryzen™ processors based on "Zen" microarchitecture and newer.
|
||||
|
||||
| CPU Family | Status | Notes |
|
||||
|:-----------------------------:|:-------:|:----------------------------------:|
|
||||
| AMD EPYC™ 9005 Series (Turin) | Support | 5th Gen - Zen 5 architecture |
|
||||
| AMD EPYC™ 9004 Series (Genoa) | Support | 4th Gen - Zen 4 architecture |
|
||||
| AMD EPYC™ 7003 Series (Milan) | Support | 3rd Gen - Zen 3 architecture |
|
||||
| AMD Ryzen™ AI MAX (Strix Halo)| Support | High-performance mobile processors |
|
||||
|
||||
*Notes:*
|
||||
|
||||
- Best performance is achieved on AMD EPYC™ processors with high core counts (e.g., EPYC 9005 series).
|
||||
- ZenDNN leverages AMD's advanced CPU features including AVX2 and AVX-512 instruction sets.
|
||||
- For optimal performance, ensure your system has sufficient memory bandwidth.
|
||||
|
||||
## Supported Operations
|
||||
|
||||
The ZenDNN backend accelerates **matrix multiplication (MUL_MAT)** and **expert-based matrix multiplication (MUL_MAT_ID)** operations. Other operations are handled by the standard CPU backend.
|
||||
|
||||
| Operation | Status | Notes |
|
||||
|:-------------|:-------:|:----------------------------------------------:|
|
||||
| MUL_MAT | Support | Accelerated via ZenDNN LowOHA MatMul |
|
||||
| MUL_MAT_ID | Support | Accelerated via ZenDNN LowOHA MatMul (MoE) |
|
||||
|
||||
*Note:* Since MUL_MAT and MUL_MAT_ID are accelerated, models will benefit most from ZenDNN when matrix multiplications dominate the computational workload (which is typical for transformer-based LLMs and Mixture-of-Experts models).
|
||||
|
||||
## DataType Supports
|
||||
|
||||
| DataType | Status | Notes |
|
||||
|:----------------------:|:-------:|:---------------------------------------------:|
|
||||
| FP32 | Support | Full precision floating point |
|
||||
| BF16 | Support | BFloat16 (best performance on Zen 4/Zen 5) |
|
||||
|
||||
*Notes:*
|
||||
|
||||
- **BF16** provides best performance on Zen 4 and Zen 5 EPYC™ processors (Genoa, Turin).
|
||||
|
||||
## Linux
|
||||
|
||||
### I. Setup Environment
|
||||
|
||||
You have two options to set up ZenDNN:
|
||||
|
||||
#### Option 1: Automatic Download and Build (Recommended)
|
||||
|
||||
CMake will automatically download and build ZenDNN for you:
|
||||
|
||||
```sh
|
||||
# Build llama.cpp - ZenDNN will be automatically downloaded and built
|
||||
cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release
|
||||
cmake --build build --config Release -j $(nproc)
|
||||
```
|
||||
|
||||
No manual ZenDNN installation required. CMake will handle everything automatically.
|
||||
|
||||
#### Option 2: Use Custom ZenDNN Installation
|
||||
|
||||
If you want to build ZenDNN yourself or use a specific version:
|
||||
|
||||
**Step 1: Build ZenDNN from source**
|
||||
|
||||
```sh
|
||||
# Clone ZenDNN repository
|
||||
git clone https://github.com/amd/ZenDNN.git
|
||||
cd ZenDNN
|
||||
|
||||
# Build and install (requires CMake >= 3.25)
|
||||
mkdir build && cd build
|
||||
cmake ..
|
||||
cmake --build . --target all
|
||||
```
|
||||
|
||||
Default installation path: `ZenDNN/build/install`
|
||||
|
||||
**For detailed build instructions**, refer to the [ZenDNN README](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/README.md).
|
||||
|
||||
**Step 2: Build llama.cpp with custom ZenDNN path**
|
||||
|
||||
```sh
|
||||
# Using environment variable
|
||||
export ZENDNN_ROOT=/path/to/ZenDNN/build/install
|
||||
cmake -B build -DGGML_ZENDNN=ON -DCMAKE_BUILD_TYPE=Release
|
||||
cmake --build build --config Release -j $(nproc)
|
||||
|
||||
# OR specify path directly in CMake
|
||||
cmake -B build -DGGML_ZENDNN=ON -DZENDNN_ROOT=/path/to/ZenDNN/build/install -DCMAKE_BUILD_TYPE=Release
|
||||
cmake --build build --config Release -j $(nproc)
|
||||
```
|
||||
|
||||
### II. Run the Server
|
||||
|
||||
#### 1. Download Model
|
||||
|
||||
Download LLaMA 3.1 8B Instruct BF16 model:
|
||||
|
||||
```sh
|
||||
# Download from Hugging Face
|
||||
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct-GGUF --local-dir models/
|
||||
```
|
||||
|
||||
#### 2. Start Server
|
||||
|
||||
Run llama.cpp server with ZenDNN acceleration:
|
||||
|
||||
```sh
|
||||
# Set optimal configuration
|
||||
export ZENDNNL_MATMUL_ALGO=1 # Blocked AOCL DLP algo for best performance
|
||||
|
||||
# Start server
|
||||
./build/bin/llama-server \
|
||||
-m models/Llama-3.1-8B-Instruct.BF16.gguf \
|
||||
--host 0.0.0.0 \
|
||||
--port 8080 \
|
||||
-t 64
|
||||
```
|
||||
|
||||
Access the server at `http://localhost:8080`.
|
||||
|
||||
**Performance tips**:
|
||||
- Use `ZENDNNL_MATMUL_ALGO=1` for optimal performance
|
||||
- For NUMA systems: `numactl --cpunodebind=0 --membind=0 ./build/bin/llama-server ...`
|
||||
|
||||
## Environment Variable
|
||||
|
||||
For environment variables related to ZenDNN, refer to the [ZenDNN Environment Variables Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/runtime_env.md).
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
ZenDNN's LowOHA MatMul supports multiple backend algorithms. For **best performance**, use the **Blocked AOCL DLP** algorithm:
|
||||
|
||||
```sh
|
||||
export ZENDNNL_MATMUL_ALGO=1 # Blocked AOCL DLP algo (recommended)
|
||||
```
|
||||
|
||||
For more details on available algorithms, see the [ZenDNN MatMul Algorithm Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/runtime_env.md#algorithm-details).
|
||||
|
||||
### Profiling and Debugging
|
||||
|
||||
For detailed profiling and logging options, refer to the [ZenDNN Logging Documentation](https://github.com/amd/ZenDNN/blob/a18adf8c605fb5f5e52cefd7eda08a7b18febbaf/docs/logging.md).
|
||||
|
||||
## Known Issues
|
||||
|
||||
- **Limited operation support**: Currently matrix multiplication (MUL_MAT) and expert-based matrix multiplication (MUL_MAT_ID) are accelerated via ZenDNN. Other operations fall back to the standard CPU backend. Future updates may expand supported operations.
|
||||
- **BF16 support**: BF16 operations require AMD Zen 4 or Zen 5 architecture (EPYC 9004/9005 series). On older CPUs, operations will use FP32.
|
||||
- **NUMA awareness**: For multi-socket systems, manual NUMA binding may be required for optimal performance.
|
||||
|
||||
## Q&A
|
||||
|
||||
**Q: How do I verify that ZenDNN backend is being used?**
|
||||
|
||||
A: Check the log output when running llama.cpp. You should see messages indicating the ZenDNN backend is initialized. You can also check the backend name in the output.
|
||||
|
||||
**Q: What performance improvement can I expect?**
|
||||
|
||||
A: Performance gains vary depending on the model size, batch size, and CPU architecture. On AMD EPYC processors, you can typically expect 1.1x-2x speedup compared to standard CPU inference for matrix multiplication operations.
|
||||
|
||||
**Q: Can I use ZenDNN on non-AMD processors?**
|
||||
|
||||
A: ZenDNN is optimized specifically for AMD processors. While it may work on other x86-64 CPUs, performance benefits are only guaranteed on AMD Zen-based architectures.
|
||||
|
||||
**Q: Does ZenDNN support quantized models?**
|
||||
|
||||
A: Currently, ZenDNN primarily supports FP32 and BF16 data types. Quantized model support is not available at this time.
|
||||
|
||||
**Q: Why is my inference not faster with ZenDNN?**
|
||||
|
||||
A: Ensure:
|
||||
1. You're using an AMD EPYC or Ryzen processor (Zen 2 or newer)
|
||||
2. `ZENDNNL_MATMUL_ALGO=1` is set for best performance (Blocked AOCL DLP)
|
||||
3. You're using a sufficiently large model (small models may not benefit as much)
|
||||
4. Enable profiling to verify ZenDNN MatMul is being called
|
||||
|
||||
### **GitHub Contribution**:
|
||||
Please add the **[ZenDNN]** prefix/tag in issues/PRs titles to help the ZenDNN-team check/address them without delay.
|
||||
|
||||
## TODO
|
||||
|
||||
- Expand operation support beyond MUL_MAT and MUL_MAT_ID (attention operations, activations, etc.)
|
||||
90
docs/backend/snapdragon/CMakeUserPresets.json
Normal file
90
docs/backend/snapdragon/CMakeUserPresets.json
Normal file
@@ -0,0 +1,90 @@
|
||||
{
|
||||
"version": 5,
|
||||
"configurePresets": [
|
||||
{
|
||||
"name": "arm64-android-snapdragon",
|
||||
"hidden": true,
|
||||
"architecture": { "value": "arm64", "strategy": "external" },
|
||||
"toolset": { "value": "host=x86_64", "strategy": "external" },
|
||||
"cacheVariables": {
|
||||
"ANDROID_ABI": "arm64-v8a",
|
||||
"ANDROID_PLATFORM": "android-31",
|
||||
"CMAKE_TOOLCHAIN_FILE": "$env{ANDROID_NDK_ROOT}/build/cmake/android.toolchain.cmake",
|
||||
"CMAKE_C_FLAGS": "-march=armv8.7a+fp16 -fvectorize -ffp-model=fast -fno-finite-math-only -flto -D_GNU_SOURCE",
|
||||
"CMAKE_CXX_FLAGS": "-march=armv8.7a+fp16 -fvectorize -ffp-model=fast -fno-finite-math-only -flto -D_GNU_SOURCE",
|
||||
"CMAKE_C_FLAGS_RELEASE": "-O3 -DNDEBUG",
|
||||
"CMAKE_CXX_FLAGS_RELEASE": "-O3 -DNDEBUG",
|
||||
"CMAKE_C_FLAGS_RELWITHDEBINFO": "-O3 -DNDEBUG -g",
|
||||
"CMAKE_CXX_FLAGS_RELWITHDEBINFO": "-O3 -DNDEBUG -g",
|
||||
"CMAKE_PREFIX_PATH": "$env{OPENCL_SDK_ROOT}",
|
||||
"HEXAGON_SDK_ROOT": "$env{HEXAGON_SDK_ROOT}",
|
||||
"HEXAGON_TOOLS_ROOT": "$env{HEXAGON_TOOLS_ROOT}",
|
||||
"PREBUILT_LIB_DIR": "android_aarch64",
|
||||
"GGML_OPENMP": "OFF",
|
||||
"GGML_LLAMAFILE": "OFF",
|
||||
"GGML_OPENCL": "ON",
|
||||
"GGML_HEXAGON": "ON",
|
||||
"GGML_HEXAGON_FP32_QUANTIZE_GROUP_SIZE": "128",
|
||||
"LLAMA_OPENSSL": "OFF"
|
||||
}
|
||||
},
|
||||
|
||||
{
|
||||
"name": "arm64-windows-snapdragon",
|
||||
"inherits": [ "base", "arm64-windows-llvm" ],
|
||||
"cacheVariables": {
|
||||
"CMAKE_C_FLAGS": "-march=armv8.7a+fp16 -fvectorize -ffp-model=fast -flto -D_GNU_SOURCE",
|
||||
"CMAKE_CXX_FLAGS": "-march=armv8.7a+fp16 -fvectorize -ffp-model=fast -flto -D_GNU_SOURCE",
|
||||
"CMAKE_C_FLAGS_RELEASE": "-O3 -DNDEBUG",
|
||||
"CMAKE_CXX_FLAGS_RELEASE": "-O3 -DNDEBUG",
|
||||
"CMAKE_C_FLAGS_RELWITHDEBINFO": "-O3 -DNDEBUG -g",
|
||||
"CMAKE_CXX_FLAGS_RELWITHDEBINFO": "-O3 -DNDEBUG -g",
|
||||
"CMAKE_PREFIX_PATH": "$env{OPENCL_SDK_ROOT}",
|
||||
"HEXAGON_SDK_ROOT": "$env{HEXAGON_SDK_ROOT}",
|
||||
"HEXAGON_TOOLS_ROOT": "$env{HEXAGON_TOOLS_ROOT}",
|
||||
"PREBUILT_LIB_DIR": "windows_aarch64",
|
||||
"GGML_OPENMP": "OFF",
|
||||
"GGML_LLAMAFILE": "OFF",
|
||||
"GGML_OPENCL": "ON",
|
||||
"GGML_HEXAGON": "ON",
|
||||
"GGML_HEXAGON_FP32_QUANTIZE_GROUP_SIZE": "128",
|
||||
"LLAMA_OPENSSL": "OFF"
|
||||
}
|
||||
},
|
||||
|
||||
{
|
||||
"name": "arm64-linux-snapdragon",
|
||||
"hidden": true,
|
||||
"architecture": { "value": "arm64", "strategy": "external" },
|
||||
"toolset": { "value": "host=x86_64", "strategy": "external" },
|
||||
"cacheVariables": {
|
||||
"CMAKE_TOOLCHAIN_FILE": "cmake/arm64-linux-clang.cmake",
|
||||
"CMAKE_C_FLAGS": "-march=armv8 -fno-finite-math-only -flto -D_GNU_SOURCE",
|
||||
"CMAKE_CXX_FLAGS": "-march=armv8 -fno-finite-math-only -flto -D_GNU_SOURCE",
|
||||
"CMAKE_C_FLAGS_RELEASE": "-O3 -DNDEBUG",
|
||||
"CMAKE_CXX_FLAGS_RELEASE": "-O3 -DNDEBUG",
|
||||
"CMAKE_C_FLAGS_RELWITHDEBINFO": "-O3 -DNDEBUG -g",
|
||||
"CMAKE_CXX_FLAGS_RELWITHDEBINFO": "-O3 -DNDEBUG -g",
|
||||
"CMAKE_PREFIX_PATH": "$env{OPENCL_SDK_ROOT}",
|
||||
"HEXAGON_SDK_ROOT": "$env{HEXAGON_SDK_ROOT}",
|
||||
"HEXAGON_TOOLS_ROOT": "$env{HEXAGON_TOOLS_ROOT}",
|
||||
"PREBUILT_LIB_DIR": "linux_aarch64",
|
||||
"GGML_OPENMP": "OFF",
|
||||
"GGML_LLAMAFILE": "OFF",
|
||||
"GGML_OPENCL": "OFF",
|
||||
"GGML_HEXAGON": "ON",
|
||||
"GGML_HEXAGON_FP32_QUANTIZE_GROUP_SIZE": "128",
|
||||
"LLAMA_OPENSSL": "OFF"
|
||||
}
|
||||
},
|
||||
|
||||
{ "name": "arm64-android-snapdragon-debug" , "inherits": [ "base", "arm64-android-snapdragon", "debug" ] },
|
||||
{ "name": "arm64-android-snapdragon-release", "inherits": [ "base", "arm64-android-snapdragon", "release" ] },
|
||||
|
||||
{ "name": "arm64-windows-snapdragon-debug" , "inherits": [ "base", "arm64-windows-snapdragon", "debug" ] },
|
||||
{ "name": "arm64-windows-snapdragon-release", "inherits": [ "base", "arm64-windows-snapdragon", "release" ] },
|
||||
|
||||
{ "name": "arm64-linux-snapdragon-debug" , "inherits": [ "base", "arm64-linux-snapdragon", "debug" ] },
|
||||
{ "name": "arm64-linux-snapdragon-release", "inherits": [ "base", "arm64-linux-snapdragon", "release" ] }
|
||||
]
|
||||
}
|
||||
280
docs/backend/snapdragon/README.md
Normal file
280
docs/backend/snapdragon/README.md
Normal file
@@ -0,0 +1,280 @@
|
||||
# Snapdragon-based devices
|
||||
|
||||
## Setup
|
||||
|
||||
### Android
|
||||
|
||||
The easiest way to build llama.cpp for a Snapdragon-based Android device is using the toolchain Docker image (see github.com/snapdragon-toolchain).
|
||||
This image includes Android NDK, OpenCL SDK, Hexagon SDK, CMake, etc.
|
||||
|
||||
This method works on Linux, macOS, and Windows. macOS and Windows users should install Docker Desktop.
|
||||
|
||||
```
|
||||
~/src/llama.cpp$ docker run -it -u $(id -u):$(id -g) --volume $(pwd):/workspace --platform linux/amd64 ghcr.io/snapdragon-toolchain/arm64-android:v0.3
|
||||
[d]/> cd /workspace
|
||||
```
|
||||
|
||||
Note: The rest of the **Android** build process assumes that you're running inside the toolchain container.
|
||||
|
||||
### Windows On Snapdragon
|
||||
|
||||
Native Windows 11 arm64 builds has the following tools dependencies:
|
||||
- MS Visual Studio 2026 (Community Edition or Pro)
|
||||
- MSVC arm64 standard and runtime libraries
|
||||
- UCRT and Driver Kit
|
||||
- LLVM core libraries and Clang compiler (winget)
|
||||
- CMake, Git, Python (winget)
|
||||
- Hexagon SDK Community Edition 6.4 or later (see windows.md)
|
||||
- OpenCL SDK 2.3 or later (see windows.md)
|
||||
|
||||
Note: The rest of the **Windows** build process assumes that you're running natively in Powershell.
|
||||
Adapt below build commands accordingly.
|
||||
|
||||
## How to Build
|
||||
|
||||
Let's build llama.cpp with CPU, OpenCL, and Hexagon backends via CMake presets:
|
||||
|
||||
```
|
||||
[d]/workspace> cp docs/backend/snapdragon/CMakeUserPresets.json .
|
||||
|
||||
[d]/workspace> cmake --preset arm64-android-snapdragon-release -B build-snapdragon
|
||||
Preset CMake variables:
|
||||
ANDROID_ABI="arm64-v8a"
|
||||
...
|
||||
CMAKE_TOOLCHAIN_FILE="/opt/android-ndk-r28b/build/cmake/android.toolchain.cmake"
|
||||
GGML_HEXAGON="ON"
|
||||
GGML_OPENCL="ON"
|
||||
GGML_OPENMP="OFF"
|
||||
HEXAGON_SDK_ROOT="/opt/hexagon/6.4.0.2"
|
||||
...
|
||||
-- Including OpenCL backend
|
||||
-- Including Hexagon backend
|
||||
...
|
||||
-- Build files have been written to: /workspace/build-snapdragon
|
||||
|
||||
[d]/workspace> cmake --build build-snapdragon
|
||||
...
|
||||
[144/356] Performing build step for 'htp-v73'
|
||||
[1/16] Generating htp_iface_skel.c, htp_iface_stub.c, htp_iface.h
|
||||
[2/16] Building C object CMakeFiles/ggml-htp-v73.dir/hvx-sigmoid.c.obj
|
||||
[3/16] Building C object CMakeFiles/ggml-htp-v73.dir/htp-dma.c.obj
|
||||
[4/16] Building C object CMakeFiles/ggml-htp-v73.dir/worker-pool.c.obj
|
||||
...
|
||||
-- Installing: /workspace/build-snapdragon/ggml/src/ggml-hexagon/libggml-htp-v73.so
|
||||
-- Installing: /workspace/build-snapdragon/ggml/src/ggml-hexagon/libggml-htp-v75.so
|
||||
...
|
||||
```
|
||||
|
||||
To generate an installable "package" simply use cmake --install:
|
||||
|
||||
```
|
||||
[d]/workspace> cmake --install build-snapdragon --prefix pkg-snapdragon/llama.cpp
|
||||
-- Install configuration: "Release"
|
||||
-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-cpu.so
|
||||
-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-opencl.so
|
||||
-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-hexagon.so
|
||||
-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-htp-v73.so
|
||||
-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-htp-v75.so
|
||||
-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-htp-v79.so
|
||||
-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml-htp-v81.so
|
||||
-- Installing: /workspace/pkg-snapdragon/llama.cpp/lib/libggml.so
|
||||
...
|
||||
-- Installing: /workspace/pkg-snapdragon/llama.cpp/bin/llama-bench
|
||||
-- Installing: /workspace/pkg-snapdragon/llama.cpp/bin/llama-cli
|
||||
...
|
||||
```
|
||||
|
||||
## How to Install
|
||||
|
||||
### Android
|
||||
|
||||
For this step, your device needs to be configured for on-device development.
|
||||
Please see https://developer.android.com/studio/debug/dev-options for details.
|
||||
|
||||
Once ADB is enabled, use `adb push` to install `pkg-snapdragon` on the device.
|
||||
**Note that the toolchain Docker image doesn't have ADB and doesn't set up the ADB bridge. Please use native ADB on the host.**
|
||||
|
||||
```
|
||||
~/src/llama.cpp$ adb push pkg-snapdragon/llama.cpp /data/local/tmp/
|
||||
pkg-snapdragon/llama.cpp/bin/: 67 files pushed, 0 skipped. 190.2 MB/s (919095042 bytes in 4.607s)
|
||||
pkg-snapdragon/llama.cpp/include/: 19 files pushed, 0 skipped. 20.5 MB/s (255173 bytes in 0.012s)
|
||||
pkg-snapdragon/llama.cpp/lib/: 16 files pushed, 0 skipped. 144.4 MB/s (43801382 bytes in 0.289s)
|
||||
102 files pushed, 0 skipped. 186.9 MB/s (963151597 bytes in 4.914s)
|
||||
```
|
||||
|
||||
At this point, you should also install some models:
|
||||
|
||||
```
|
||||
~/src/llama.cpp$ wget https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_0.gguf
|
||||
...
|
||||
2025-10-11 12:04:52 (10.7 MB/s) - ‘Llama-3.2-1B-Instruct-Q4_0.gguf’ saved [773025920/773025920]
|
||||
|
||||
~/src/llama.cpp$ adb push Llama-3.2-1B-Instruct-Q4_0.gguf /data/local/tmp/gguf
|
||||
Llama-3.2-1B-Instruct-Q4_0.gguf: 1 file pushed, 0 skipped. 38.3 MB/s (773025920 bytes in 19.250s)
|
||||
```
|
||||
|
||||
### Windows
|
||||
|
||||
All artifacts are already installed in the `pkg-snapdragon` folder.
|
||||
To run, adapt below instructions to use Powershell scripts in `scripts/snapdragon/windows`.
|
||||
|
||||
## How to Run
|
||||
|
||||
The easiest way to run llama.cpp cli tools is using provided wrapper scripts that properly set up all required environment variables.
|
||||
|
||||
llama.cpp supports three backends on Snapdragon-based devices: CPU, Adreno GPU (GPUOpenCL), and Hexagon NPU (HTP0-4).
|
||||
You can select which backend to run the model on using the `D=` variable, which maps to the `--device` option.
|
||||
|
||||
Hexagon NPU behaves as a "GPU" device when it comes to `-ngl` and other offload-related options.
|
||||
|
||||
Here are some examples of running various llama.cpp tools via ADB.
|
||||
|
||||
Simple question for Llama-3.2-1B
|
||||
|
||||
```
|
||||
~/src/llama.cpp$ M=Llama-3.2-1B-Instruct-Q4_0.gguf D=HTP0 ./scripts/snapdragon/adb/run-completion.sh -p "what is the most popular cookie in the world?"
|
||||
...
|
||||
ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
|
||||
ggml-hex: Hexagon Arch version v79
|
||||
ggml-hex: allocating new session: HTP0
|
||||
ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb4000072c7955e50
|
||||
...
|
||||
load_tensors: offloading output layer to GPU
|
||||
load_tensors: offloaded 17/17 layers to GPU
|
||||
load_tensors: CPU model buffer size = 225.49 MiB
|
||||
load_tensors: HTP0 model buffer size = 0.26 MiB
|
||||
load_tensors: HTP0-REPACK model buffer size = 504.00 MiB
|
||||
...
|
||||
I hope this helps you understand the world's most popular cookies! [end of text]
|
||||
...
|
||||
llama_perf_sampler_print: sampling time = 30.08 ms / 487 runs ( 0.06 ms per token, 16191.77 tokens per second)
|
||||
llama_perf_context_print: load time = 617.94 ms
|
||||
llama_perf_context_print: prompt eval time = 80.76 ms / 11 tokens ( 7.34 ms per token, 136.21 tokens per second)
|
||||
llama_perf_context_print: eval time = 9210.59 ms / 475 runs ( 19.39 ms per token, 51.57 tokens per second)
|
||||
llama_perf_context_print: total time = 9454.92 ms / 486 tokens
|
||||
llama_perf_context_print: graphs reused = 473
|
||||
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
||||
llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 |
|
||||
llama_memory_breakdown_print: | - Host | 439 = 225 + 136 + 77 |
|
||||
llama_memory_breakdown_print: | - HTP0-REPACK | 504 = 504 + 0 + 0 |
|
||||
```
|
||||
|
||||
Summary request for OLMoE-1B-7B. This is a large model that requires two HTP sessions/devices
|
||||
|
||||
```
|
||||
~/src/llama.cpp$ M=OLMoE-1B-7B-0125-Instruct-Q4_0.gguf NDEV=2 D=HTP0,HTP1 ./scripts/snapdragon/adb/run-completion.sh -f surfing.txt
|
||||
...
|
||||
ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
|
||||
ggml-hex: Hexagon Arch version v81
|
||||
ggml-hex: allocating new session: HTP0
|
||||
ggml-hex: allocating new session: HTP1
|
||||
...
|
||||
load_tensors: offloading output layer to GPU
|
||||
load_tensors: offloaded 17/17 layers to GPU
|
||||
load_tensors: CPU model buffer size = 143.86 MiB
|
||||
load_tensors: HTP1 model buffer size = 0.23 MiB
|
||||
load_tensors: HTP1-REPACK model buffer size = 1575.00 MiB
|
||||
load_tensors: HTP0 model buffer size = 0.28 MiB
|
||||
load_tensors: HTP0-REPACK model buffer size = 2025.00 MiB
|
||||
...
|
||||
llama_context: CPU output buffer size = 0.19 MiB
|
||||
llama_kv_cache: HTP1 KV buffer size = 238.00 MiB
|
||||
llama_kv_cache: HTP0 KV buffer size = 306.00 MiB
|
||||
llama_kv_cache: size = 544.00 MiB ( 8192 cells, 16 layers, 1/1 seqs), K (q8_0): 272.00 MiB, V (q8_0): 272.00 MiB
|
||||
llama_context: HTP0 compute buffer size = 15.00 MiB
|
||||
llama_context: HTP1 compute buffer size = 15.00 MiB
|
||||
llama_context: CPU compute buffer size = 24.56 MiB
|
||||
...
|
||||
llama_perf_context_print: prompt eval time = 1730.57 ms / 212 tokens ( 8.16 ms per token, 122.50 tokens per second)
|
||||
llama_perf_context_print: eval time = 5624.75 ms / 257 runs ( 21.89 ms per token, 45.69 tokens per second)
|
||||
llama_perf_context_print: total time = 7377.33 ms / 469 tokens
|
||||
llama_perf_context_print: graphs reused = 255
|
||||
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
||||
llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 |
|
||||
llama_memory_breakdown_print: | - HTP1 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 |
|
||||
llama_memory_breakdown_print: | - Host | 742 = 144 + 544 + 54 |
|
||||
llama_memory_breakdown_print: | - HTP1-REPACK | 1575 = 1575 + 0 + 0 |
|
||||
llama_memory_breakdown_print: | - HTP0-REPACK | 2025 = 2025 + 0 + 0 |
|
||||
```
|
||||
|
||||
Op test for MUL_MAT
|
||||
|
||||
```
|
||||
~/src/llama.cpp$ HB=0 ./scripts/snapdragon/adb/run-tool.sh test-backend-ops -b HTP0 -o MUL_MAT
|
||||
...
|
||||
Backend 2/3: HTP0
|
||||
Device description: Hexagon
|
||||
Device memory: 2048 MB (2048 MB free)
|
||||
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
|
||||
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=2,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
|
||||
MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=3,k=256,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=0,o=1): OK
|
||||
|
||||
~/src/llama.cpp-hexagon$ M=Llama-3.2-1B-Instruct-Q4_0.gguf ./scripts/snapdragon/adb/run-bench.sh -p 128 -n 64
|
||||
...
|
||||
ggml-hex: Hexagon backend (experimental) : allocating new registry : ndev 1
|
||||
ggml-hex: Hexagon Arch version v79
|
||||
ggml-hex: allocating new session: HTP0
|
||||
ggml-hex: new session: HTP0 : session-id 0 domain-id 3 uri file:///libggml-htp-v79.so?htp_iface_skel_handle_invoke&_modver=1.0&_dom=cdsp&_session=0 handle 0xb400007d4b231090
|
||||
| model | size | params | backend | ngl | threads | n_batch | mmap | test | t/s |
|
||||
| ---------------| ---------: | -----: | ---------- | --: | ------: | ------: | ---: | ----: | ------------: |
|
||||
| llama 1B Q4_0 | 729.75 MiB | 1.24 B | HTP | 99 | 4 | 128 | 0 | pp128 | 169.42 ± 1.75 |
|
||||
| llama 1B Q4_0 | 729.75 MiB | 1.24 B | HTP | 99 | 4 | 128 | 0 | tg64 | 51.54 ± 1.13 |
|
||||
|
||||
build: 6a8cf8914 (6733)
|
||||
```
|
||||
|
||||
## Environment variables
|
||||
|
||||
- `GGML_HEXAGON_NDEV=1`
|
||||
Controls the number of devices/sessions to allocate. The default is 1.
|
||||
Most quantized models under 4B fit into a single session; an 8B model needs two, and a 20B model needs four.
|
||||
|
||||
- `GGML_HEXAGON_NHVX=0`
|
||||
Controls the number of HVX hardware threads to use. The default is all (actual number varies depending on the hardware version).
|
||||
|
||||
- `GGML_HEXAGON_HOSTBUF=1`
|
||||
Controls whether the Hexagon backend allocates host buffers. By default, all buffers except for REPACK are host buffers.
|
||||
This option is required for testing Ops that require REPACK buffers (MUL_MAT and MUL_MAT_ID).
|
||||
|
||||
- `GGML_HEXAGON_VERBOSE=1`
|
||||
Enables verbose logging of Ops from the backend. Example output:
|
||||
|
||||
```
|
||||
ggml-hex: HTP0 graph-compute n_nodes 2
|
||||
ggml-hex: HTP0 matmul : blk.27.ffn_up.weight x ffn_norm-27 -> ffn_up-27 : 3072:8192 x 3072:1 -> 8192:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x1
|
||||
ggml-hex: HTP0 matmul : blk.27.ffn_gate.weight x ffn_norm-27 -> ffn_gate-27 : 3072:8192 x 3072:1 -> 8192:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x3
|
||||
ggml-hex: HTP0 graph-compute n_nodes 1
|
||||
ggml-hex: HTP0 matmul : blk.27.ffn_down.weight x ffn_gate_par-27 -> ffn_out-27 : 8192:3072 x 8192:1 -> 3072:1 : q4_0 x f32 -> f32 : HTP0 x HTP0 -> HTP0 : flags 0x0
|
||||
ggml-hex: HTP0 get-tensor result_output : data 0x7592487000 offset 0 size 513024
|
||||
```
|
||||
|
||||
- `GGML_HEXAGON_PROFILE=1`
|
||||
Enables Op profiling:
|
||||
|
||||
- `1` Basic profile with per-op `usecs` and `cycles` counters
|
||||
- `2` Extended profile with per-op `usecs`, `cycles` and default PMU counter data
|
||||
- `0x1,...,0x8` Extended profile with per-op `usecs`, `cycles` and custom PMU counter data
|
||||
|
||||
The logging output can be either saved into a file for post-processing or it can be piped directly into the post-processing tool to generate the report.
|
||||
Examples:
|
||||
|
||||
`GGML_HEXAGON_PROFILE=1 llama-completion ... |& ./scripts/snapdragon/ggml-hexagon-profile.py -`
|
||||
|
||||
- `GGML_HEXAGON_OPSTAGE=0x0`
|
||||
Allows enabling specific stages of the Op processing pipeline:
|
||||
|
||||
- `0x1` Enable Op Queue (i.e., queuing Ops into NPU)
|
||||
- `0x2` Enable Op Compute (MUL_MAT, etc.)
|
||||
|
||||
Examples:
|
||||
|
||||
`GGML_HEXAGON_OPSTAGE=0x1 llama-completion ...` - Ops are enqueued to the NPU but dma & compute are disabled
|
||||
`GGML_HEXAGON_OPSTAGE=0x3 llama-completion ...` - Full queuing and processing of Ops (default)
|
||||
|
||||
- `GGML_HEXAGON_OPFILTER=regex`
|
||||
Allows filtering (disabling) Ops that match the regex pattern:
|
||||
|
||||
Examples:
|
||||
|
||||
`GGML_HEXAGON_OPFILTER="FLASH_ATTN_EXT" llama-completion ...` - Disable Flash Attention on Hexagon (falls back to CPU or GPU)
|
||||
`GGML_HEXAGON_OPFILTER="ADD\|SUB" llama-completion ...` - Disable ADD and SUB on Hexagon (fall back to CPU or GPU)
|
||||
109
docs/backend/snapdragon/developer.md
Normal file
109
docs/backend/snapdragon/developer.md
Normal file
@@ -0,0 +1,109 @@
|
||||
# Hexagon backend developer details
|
||||
|
||||
## Backend libraries
|
||||
|
||||
The Hexagon backend consist of two parts:
|
||||
|
||||
- `libggml-hexagon`
|
||||
This is the regular CPU-side GGML backend library, either shared or statically linked
|
||||
|
||||
- `libggml-htp-vNN`
|
||||
This is the NPU-side (HTP stands for Hexagon Tensor Processor) shared library that contains the Op dispatcher and kernels.
|
||||
The correct library is selected automatically at runtime based on the HW version.
|
||||
|
||||
Here is an example of the build artifacts
|
||||
|
||||
```
|
||||
~/src/llama.cpp$ ls -l pkg-adb/llama.cpp/lib/libggml*
|
||||
pkg-adb/llama.cpp/lib/libggml-base.so
|
||||
pkg-adb/llama.cpp/lib/libggml-cpu.so
|
||||
pkg-adb/llama.cpp/lib/libggml-hexagon.so <<< CPU library
|
||||
pkg-adb/llama.cpp/lib/libggml-htp-v73.so <<< HTP op/kernels for Hexagon v73
|
||||
pkg-adb/llama.cpp/lib/libggml-htp-v75.so
|
||||
pkg-adb/llama.cpp/lib/libggml-htp-v79.so
|
||||
pkg-adb/llama.cpp/lib/libggml-htp-v81.so
|
||||
```
|
||||
|
||||
## Memory buffers
|
||||
|
||||
Hexagon NPU backend takes advantage of the Snapdragon's unified memory model where all buffers are fully accessible by the CPU and GPU.
|
||||
The NPU does have a dedicated tightly-coupled memory called VTCM but that memory is used only for intermediate data (e.g. dynamically
|
||||
quantized tensors) or temporary data (chunks of the weight tensors fetched via DMA).
|
||||
|
||||
Please note that currently the Hexagon backend does not implement SET/GET_ROWS Ops because there is no advantage in offloading those
|
||||
to the NPU at this point.
|
||||
|
||||
The backend does allocates non-host buffers for the tensors with datatypes that require repacking: Q4_0, Q8_0, MXFP4.
|
||||
From the MMU perspective these buffers are still regular buffers (normal access by the CPU) they are marked as non-host simply to force
|
||||
the repacking.
|
||||
|
||||
## Large model handling
|
||||
|
||||
Hexagon NPU session (aka Process Domain (PD) in the Hexagon docs) is limited to a memory mapping of around 3.5GB.
|
||||
In llama.cpp/GGML the Hexagon session is mapped to a single GGML backend device (HTP0, HTP1, etc).
|
||||
|
||||
In order to map models larger than 3.5GB we need to allocate multiple devices and split the model.
|
||||
For this we're taking advantage of the llama.cpp/GGML multi-GPU layer-splitting support.
|
||||
Each Hexagon device behaves like a GPU from the offload and model splitting perspective.
|
||||
|
||||
Here is an example of running GPT-OSS-20B model on a newer Snapdragon device with 16GB of DDR.
|
||||
|
||||
```
|
||||
M=gpt-oss-20b-Q4_0.gguf NDEV=4 D=HTP0,HTP1,HTP2,HTP3 P=surfing.txt scripts/snapdragon/adb/run-completion.sh -f surfing.txt -n 32
|
||||
...
|
||||
LD_LIBRARY_PATH=/data/local/tmp/llama.cpp/lib
|
||||
ADSP_LIBRARY_PATH=/data/local/tmp/llama.cpp/lib
|
||||
GGML_HEXAGON_NDEV=4 ./bin/llama-cli --no-mmap -m /data/local/tmp/llama.cpp/../gguf/gpt-oss-20b-Q4_0.gguf
|
||||
-t 4 --ctx-size 8192 --batch-size 128 -ctk q8_0 -ctv q8_0 -fa on -ngl 99 --device HTP0,HTP1,HTP2,HTP3 -no-cnv -f surfing.txt
|
||||
...
|
||||
llama_model_loader: - type f32: 289 tensors
|
||||
llama_model_loader: - type q4_0: 96 tensors
|
||||
llama_model_loader: - type q8_0: 2 tensors
|
||||
llama_model_loader: - type mxfp4: 72 tensors
|
||||
...
|
||||
load_tensors: offloaded 25/25 layers to GPU
|
||||
load_tensors: CPU model buffer size = 1182.09 MiB
|
||||
load_tensors: HTP1 model buffer size = 6.64 MiB
|
||||
load_tensors: HTP1-REPACK model buffer size = 2505.94 MiB
|
||||
load_tensors: HTP3 model buffer size = 5.55 MiB
|
||||
load_tensors: HTP3-REPACK model buffer size = 2088.28 MiB
|
||||
load_tensors: HTP0 model buffer size = 7.75 MiB
|
||||
load_tensors: HTP0-REPACK model buffer size = 2923.59 MiB
|
||||
load_tensors: HTP2 model buffer size = 6.64 MiB
|
||||
load_tensors: HTP2-REPACK model buffer size = 2505.94 MiB
|
||||
...
|
||||
llama_context: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
|
||||
llama_context: CPU output buffer size = 0.77 MiB
|
||||
llama_kv_cache_iswa: creating non-SWA KV cache, size = 8192 cells
|
||||
llama_kv_cache: HTP1 KV buffer size = 25.50 MiB
|
||||
llama_kv_cache: HTP3 KV buffer size = 25.50 MiB
|
||||
llama_kv_cache: HTP0 KV buffer size = 25.50 MiB
|
||||
llama_kv_cache: HTP2 KV buffer size = 25.50 MiB
|
||||
llama_kv_cache: size = 102.00 MiB ( 8192 cells, 12 layers, 1/1 seqs), K (q8_0): 51.00 MiB, V (q8_0): 51.00 MiB
|
||||
llama_kv_cache_iswa: creating SWA KV cache, size = 256 cells
|
||||
llama_kv_cache: HTP1 KV buffer size = 0.80 MiB
|
||||
llama_kv_cache: HTP3 KV buffer size = 0.53 MiB
|
||||
llama_kv_cache: HTP0 KV buffer size = 1.06 MiB
|
||||
llama_kv_cache: HTP2 KV buffer size = 0.80 MiB
|
||||
llama_kv_cache: size = 3.19 MiB ( 256 cells, 12 layers, 1/1 seqs), K (q8_0): 1.59 MiB, V (q8_0): 1.59 MiB
|
||||
llama_context: HTP0 compute buffer size = 16.06 MiB
|
||||
llama_context: HTP1 compute buffer size = 16.06 MiB
|
||||
llama_context: HTP2 compute buffer size = 16.06 MiB
|
||||
llama_context: HTP3 compute buffer size = 16.06 MiB
|
||||
llama_context: CPU compute buffer size = 98.19 MiB
|
||||
...
|
||||
llama_perf_context_print: prompt eval time = 3843.67 ms / 197 tokens ( 19.51 ms per token, 51.25 tokens per second)
|
||||
llama_perf_context_print: eval time = 1686.13 ms / 31 runs ( 54.39 ms per token, 18.39 tokens per second)
|
||||
llama_perf_context_print: total time = 6266.30 ms / 228 tokens
|
||||
llama_perf_context_print: graphs reused = 30
|
||||
llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted |
|
||||
llama_memory_breakdown_print: | - HTP0 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 |
|
||||
llama_memory_breakdown_print: | - HTP1 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 |
|
||||
llama_memory_breakdown_print: | - HTP2 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 |
|
||||
llama_memory_breakdown_print: | - HTP3 (Hexagon) | 2048 = 2048 + ( 0 = 0 + 0 + 0) + 0 |
|
||||
llama_memory_breakdown_print: | - Host | 1476 = 1208 + 105 + 162 |
|
||||
llama_memory_breakdown_print: | - HTP1-REPACK | 2505 = 2505 + 0 + 0 |
|
||||
llama_memory_breakdown_print: | - HTP3-REPACK | 2088 = 2088 + 0 + 0 |
|
||||
llama_memory_breakdown_print: | - HTP0-REPACK | 2923 = 2923 + 0 + 0 |
|
||||
llama_memory_breakdown_print: | - HTP2-REPACK | 2505 = 2505 + 0 + 0 |
|
||||
```
|
||||
58
docs/backend/snapdragon/linux.md
Normal file
58
docs/backend/snapdragon/linux.md
Normal file
@@ -0,0 +1,58 @@
|
||||
# Snapdragon-based Linux devices
|
||||
|
||||
## Docker Setup
|
||||
|
||||
The easiest way to build llama.cpp for a Snapdragon-based Linux device is using the toolchain Docker image (see [github.com/snapdragon-toolchain](https://github.com/snapdragon-toolchain)).
|
||||
This image includes OpenCL SDK, Hexagon SDK, CMake, and the ARM64 Linux cross-compilation toolchain.
|
||||
|
||||
Cross-compilation is supported on **Linux X86** hosts. The resulting binaries are deployed to and run on the target **Qualcomm Snapdragon ARM64 Linux** device.
|
||||
|
||||
```
|
||||
~/src/llama.cpp$ docker run -it -u $(id -u):$(id -g) --volume $(pwd):/workspace --platform linux/amd64 ghcr.io/snapdragon-toolchain/arm64-linux:v0.1
|
||||
[d]/> cd /workspace
|
||||
```
|
||||
|
||||
Note: The rest of the **Linux** build process assumes that you're running inside the toolchain container.
|
||||
|
||||
|
||||
## How to Build
|
||||
|
||||
Let's build llama.cpp with CPU, OpenCL, and Hexagon backends via CMake presets:
|
||||
|
||||
```
|
||||
[d]/workspace> cp docs/backend/snapdragon/CMakeUserPresets.json .
|
||||
|
||||
[d]/workspace> cmake --preset arm64-linux-snapdragon-release -B build-snapdragon
|
||||
|
||||
[d]/workspace> cmake --build build-snapdragon -j $(nproc)
|
||||
```
|
||||
|
||||
To generate an installable "package" simply use cmake --install, then zip it:
|
||||
|
||||
```
|
||||
[d]/workspace> cmake --install build-snapdragon --prefix pkg-snapdragon
|
||||
[d]/workspace> zip -r pkg-snapdragon.zip pkg-snapdragon
|
||||
```
|
||||
|
||||
## How to Install
|
||||
|
||||
For this step, you will deploy the built binaries and libraries to the target Linux device. Transfer `pkg-snapdragon.zip` to the target device, then unzip it and set up the environment variables:
|
||||
|
||||
```
|
||||
$ unzip pkg-snapdragon.zip
|
||||
$ cd pkg-snapdragon
|
||||
$ export LD_LIBRARY_PATH=./lib
|
||||
$ export ADSP_LIBRARY_PATH=./lib
|
||||
```
|
||||
|
||||
At this point, you should also download some models onto the device:
|
||||
|
||||
```
|
||||
$ wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_0.gguf
|
||||
```
|
||||
|
||||
## How to Run
|
||||
Next, since we have setup the environment variables, we can run the llama-cli with the Hexagon backends:
|
||||
```
|
||||
$ ./bin/llama-cli -m Llama-3.2-3B-Instruct-Q4_0.gguf --device HTP0 -ngl 99 -p "what is the most popular cookie in the world?"
|
||||
```
|
||||
161
docs/backend/snapdragon/windows.md
Normal file
161
docs/backend/snapdragon/windows.md
Normal file
@@ -0,0 +1,161 @@
|
||||
## Overview
|
||||
|
||||
The document covers procedures for installing the latest GPU and NPU drivers, and OpenCL and Hexagon SDKs.
|
||||
|
||||
|
||||
In order to use Hexagon NPU on Snapdragon Windows devices the underlying HTP Ops libraries (e.g libggml-htp-v73.so)
|
||||
must be included in the .cat file digitally signed with a trusted certificate.
|
||||
|
||||
This document covers details on how to generate personal certificate files (.pfx) and how to configure the system
|
||||
to allow for test signatures (aka test-signing).
|
||||
|
||||
## Install the latest Adreno OpenCL SDK
|
||||
|
||||
Either use the trimmed down version (optimized for CI) from
|
||||
|
||||
https://github.com/snapdragon-toolchain/opencl-sdk/releases/download/v2.3.2/adreno-opencl-sdk-v2.3.2-arm64-wos.tar.xz
|
||||
|
||||
Or download the complete official version from
|
||||
|
||||
https://softwarecenter.qualcomm.com/catalog/item/Adreno_OpenCL_SDK?version=2.3.2
|
||||
|
||||
Unzip/untar the archive into
|
||||
```
|
||||
c:\Qualcomm\OpenCL_SDK\2.3.2
|
||||
```
|
||||
|
||||
## Install the latest Hexagon SDK Community Edition
|
||||
|
||||
Either use the trimmed down version (optimized for CI) from
|
||||
|
||||
https://github.com/snapdragon-toolchain/hexagon-sdk/releases/download/v6.4.0.2/hexagon-sdk-v6.4.0.2-arm64-wos.tar.xz
|
||||
|
||||
Or download the complete official version from
|
||||
|
||||
https://softwarecenter.qualcomm.com/catalog/item/Hexagon_SDK?version=6.4.0.2
|
||||
|
||||
Unzip/untar the archive into
|
||||
```
|
||||
c:\Qualcomm\Hexagon_SDK\6.4.0.2
|
||||
```
|
||||
|
||||
## Install the latest Adreno GPU driver
|
||||
|
||||
Download the driver from
|
||||
|
||||
https://softwarecenter.qualcomm.com/catalog/item/Windows_Graphics_Driver
|
||||
|
||||
After the automated installation and reboot please make sure that the GPU device shows up in the `Device Manager` (under 'Display Adapters`)
|
||||
|
||||
## Install the latest Qualcomm NPU driver
|
||||
|
||||
Download the driver from
|
||||
|
||||
https://softwarecenter.qualcomm.com/catalog/item/Qualcomm_HND
|
||||
|
||||
After the automated installation and reboot please make sure that the Hexagon NPU device shows up in the `Device Manager` (under `Neural Processors`).
|
||||
|
||||
If the device is not available you can try installing all components (`qcnspmcdm8380`, `qcnspmcdm8380_ext`) manually.
|
||||
The components are extracted into
|
||||
```
|
||||
c:\QCDrivers\qcnspmcdm...
|
||||
```
|
||||
|
||||
## Enable NPU driver test signatures
|
||||
|
||||
Please note that the following steps are required only for the Hexagon NPU.
|
||||
Adreno GPU backend does not require test signatures.
|
||||
|
||||
### Enable testsigning
|
||||
|
||||
Use `bcdedit` to enable test-signing
|
||||
```
|
||||
> bcdedit /set TESTSIGNING ON
|
||||
```
|
||||
(Secure Boot may need to be disabled for this to work)
|
||||
|
||||
Make sure test-signing is enabled after reboot
|
||||
```
|
||||
> bcdedit /enum
|
||||
...
|
||||
testsigning Yes
|
||||
...
|
||||
```
|
||||
For additional details see Microsoft guide at
|
||||
|
||||
https://learn.microsoft.com/en-us/windows-hardware/drivers/install/the-testsigning-boot-configuration-option
|
||||
|
||||
### Create personal certificate
|
||||
|
||||
The tools required for this procedure are available as part of Windows SDK and Windows Driver Kit which should be
|
||||
installed as part of the MS Visual Studio.
|
||||
They are typically located at
|
||||
```
|
||||
c:\Program Files (x86)\Windows Kits\10\bin\10.0.26100.0
|
||||
```
|
||||
(replace 10.0.26100.0 with correct version).
|
||||
|
||||
To create personal self-signed certificate run the following commands (either from cmd or power-shell):
|
||||
```
|
||||
> cd c:\Users\MyUser
|
||||
> mkdir Certs
|
||||
> cd Certs
|
||||
> makecert -r -pe -ss PrivateCertStore -n CN=GGML.HTP.v1 -eku 1.3.6.1.5.5.7.3.3 -sv ggml-htp-v1.pvk ggml-htp-v1.cer
|
||||
> pvk2pfx.exe -pvk ggml-htp-v1.pvk -spc ggml-htp-v1.cer -pfx ggml-htp-v1.pfx
|
||||
```
|
||||
(replace `MyUser` with your username).
|
||||
|
||||
Add this certificate to `Trusted Root Certification Authorities` and `Trusted Publishers` stores.
|
||||
This can be done using `certlm` Certificate Manager tool.
|
||||
Right click on the certificate store, select `All Tasks -> Import` and follow the prompts to import the certificate from the
|
||||
PFX file you created above.
|
||||
|
||||
For additional details see Microsoft guide at
|
||||
|
||||
https://learn.microsoft.com/en-us/windows-hardware/drivers/install/introduction-to-test-signing
|
||||
|
||||
Make sure to save the PFX file, you will need it for the build procedures.
|
||||
Please note that the same certificate can be used for signing any number of builds.
|
||||
|
||||
## Build Hexagon backend with signed HTP ops libraries
|
||||
|
||||
The overall Hexagon backend build procedure for Windows on Snapdragon is the same as for other platforms.
|
||||
However, additional settings are required for generating and signing HTP Ops libraries.
|
||||
```
|
||||
> $env:OPENCL_SDK_ROOT="C:\Qualcomm\OpenCL_SDK\2.3.2"
|
||||
> $env:HEXAGON_SDK_ROOT="C:\Qualcomm\Hexagon_SDK\6.4.0.2"
|
||||
> $env:HEXAGON_TOOLS_ROOT="C:\Qualcomm\Hexagon_SDK\6.4.0.2\tools\HEXAGON_Tools\19.0.04"
|
||||
> $env:HEXAGON_HTP_CERT="c:\Users\MyUsers\Certs\ggml-htp-v1.pfx"
|
||||
> $env:WINDOWS_SDK_BIN="C:\Program Files (x86)\Windows Kits\10\bin\10.0.26100.0\arm64"
|
||||
|
||||
> cmake --preset arm64-windows-snapdragon-release -B build-wos
|
||||
...
|
||||
> cmake --install build-wos --prefix pkg-snapdragon
|
||||
```
|
||||
|
||||
Once the build is complete HTP ops libraries will be installed like this
|
||||
```
|
||||
> dir pkg-snapdragon/lib
|
||||
...
|
||||
-a---- 1/22/2026 6:01 PM 187656 libggml-htp-v73.so
|
||||
-a---- 1/22/2026 6:01 PM 191752 libggml-htp-v75.so
|
||||
-a---- 1/22/2026 6:01 PM 187656 libggml-htp-v79.so
|
||||
-a---- 1/22/2026 6:01 PM 187656 libggml-htp-v81.so
|
||||
-a---- 1/22/2026 6:01 PM 4139 libggml-htp.cat
|
||||
```
|
||||
|
||||
The .cat file, the signature and proper certificate installation can be verified with
|
||||
|
||||
```
|
||||
> signtool.exe verify /v /pa .\pkg-snapdragon\lib\libggml-htp.cat
|
||||
Verifying: .\pkg-snapdragon\lib\libggml-htp.cat
|
||||
|
||||
Signature Index: 0 (Primary Signature)
|
||||
Hash of file (sha256): 9820C664DA59D5EAE31DBB664127FCDAEF59CDC31502496BC567544EC2F401CF
|
||||
|
||||
Signing Certificate Chain:
|
||||
Issued to: GGML.HTP.v1
|
||||
...
|
||||
Successfully verified: .\pkg-snapdragon\lib\libggml-htp.cat
|
||||
...
|
||||
```
|
||||
66
docs/backend/zDNN.md
Normal file
66
docs/backend/zDNN.md
Normal file
@@ -0,0 +1,66 @@
|
||||
# llama.cpp for IBM zDNN Accelerator
|
||||
|
||||
> [!WARNING]
|
||||
> **Note:** zDNN is **not** the same as ZenDNN.
|
||||
> - **zDNN** (this page): IBM's Deep Neural Network acceleration library for IBM Z & LinuxONE Mainframes
|
||||
> - **ZenDNN**: AMD's deep learning library for AMD EPYC CPUs ([see ZenDNN documentation](ZenDNN.md))
|
||||
|
||||
## Background
|
||||
|
||||
IBM zDNN (Z Deep Neural Network) is a hardware acceleration library designed specifically to leverage the IBM NNPA (Neural Network Processor Assist) accelerator located within IBM Telum I and II processors. It provides significant performance improvements for neural network inference operations.
|
||||
|
||||
### Llama.cpp + IBM zDNN
|
||||
|
||||
The llama.cpp zDNN backend is designed to enable llama.cpp on IBM z17 and later systems via the IBM zDNN hardware acceleration library.
|
||||
|
||||
## Software & Hardware Support
|
||||
|
||||
| Hardware Level | Status | Verified |
|
||||
| -------------------- | ------------- | -------------------------- |
|
||||
| IBM z17 / LinuxONE 5 | Supported | RHEL 9.6, IBM z17, 40 IFLs |
|
||||
| IBM z16 / LinuxONE 4 | Not Supported | |
|
||||
|
||||
## Data Types Supported
|
||||
|
||||
| Data Type | Status |
|
||||
| --------- | --------- |
|
||||
| F32 | Supported |
|
||||
| F16 | Supported |
|
||||
| BF16 | Supported |
|
||||
|
||||
## CMake Options
|
||||
|
||||
The IBM zDNN backend has the following CMake options that control the behaviour of the backend.
|
||||
|
||||
| CMake Option | Default Value | Description |
|
||||
| ------------ | ------------- | ----------------------------------- |
|
||||
| `GGML_ZDNN` | `OFF` | Compile llama.cpp with zDNN support |
|
||||
| `ZDNN_ROOT` | `""` | Override zDNN library lookup |
|
||||
|
||||
## 1. Install zDNN Library
|
||||
|
||||
Note: Using the zDNN library provided via `apt` or `yum` may not work correctly as reported in [#15772](https://github.com/ggml-org/llama.cpp/issues/15772). It is preferred that you compile from source.
|
||||
|
||||
```sh
|
||||
git clone --recurse-submodules https://github.com/IBM/zDNN
|
||||
cd zDNN
|
||||
|
||||
autoreconf .
|
||||
./configure --prefix=/opt/zdnn-libs
|
||||
|
||||
make build
|
||||
sudo make install
|
||||
```
|
||||
|
||||
## 2. Build llama.cpp
|
||||
|
||||
```sh
|
||||
git clone https://github.com/ggml-org/llama.cpp
|
||||
cd llama.cpp
|
||||
|
||||
cmake -S . -G Ninja -B build \
|
||||
-DCMAKE_BUILD_TYPE=Release \
|
||||
-DGGML_ZDNN=ON \
|
||||
-DZDNN_ROOT=/opt/zdnn-libs
|
||||
cmake --build build --config Release -j$(nproc)
|
||||
```
|
||||
Reference in New Issue
Block a user