diff --git a/docs_v1.0/API_V1.0.0/DEPLOY/GEM4_LLM_DEPLOY_PLAN_V1.0.0.md b/docs_v1.0/API_V1.0.0/DEPLOY/GEM4_LLM_DEPLOY_PLAN_V1.0.0.md index dcdc309..610618f 100644 --- a/docs_v1.0/API_V1.0.0/DEPLOY/GEM4_LLM_DEPLOY_PLAN_V1.0.0.md +++ b/docs_v1.0/API_V1.0.0/DEPLOY/GEM4_LLM_DEPLOY_PLAN_V1.0.0.md @@ -1,159 +1,316 @@ --- -document_type: "deployment_plan" +document_type: "deployment_record" service: "MOMENTRY_CORE" -title: "Gemma 4 LLM 部署計劃 — M5 Max MacBook Pro" +title: "Gemma 4 31B — M5 Max 部署記錄" date: "2026-05-06" -version: "V1.0" -status: "draft" +version: "V1.1" +status: "active" owner: "Warren" created_by: "OpenCode" --- -# Gemma 4 LLM 部署計劃 — M5 Max +# Gemma 4 31B — M5 Max 部署記錄 ## 1. 環境 -| 項目 | 規格 | -|------|------| -| 機型 | MacBook Pro M5 Max | -| 統一記憶體 | 48 GB | -| 架構 | arm64 (Apple Silicon) | -| SSH | `accusys@10.10.10.10` | -| 外網 | ❌ 無(需透過本機 scp) | -| 本機 | M4 Mac,有外網,已有 llama.cpp | +| 項目 | M4(開發機) | M5 Max(LLM 伺服器) | +|------|------------|-------------------| +| 機型 | MacBook Pro M4 | MacBook Pro M5 Max | +| 記憶體 | 16 GB | **48 GB** | +| 架構 | arm64 | arm64 | +| OS | macOS 26.x | macOS 26.4.1 | +| IP(初始) | — | 10.10.10.10 | +| IP(最終) | — | **192.168.110.201** | +| 外網 | 有 | 先無 → 後有(接上同網段 192.168.110.x) | +| Homebrew | 有 | 無(用戶非 admin,無法 sudo brew) | +| Xcode CLT | 有 | 無(install_name_tool、codesign 不可用) | +| Rust | 有 | rustup 已安裝 (1.95.0) | +| 專案目錄 | `/Users/accusys/momentry_core_0.1/` | `~/momentry_core_0.1/`(已 clone) | -## 2. 模型選擇 +## 2. 模型規格 -| 版本 | 參數 | Q5_K_M 大小 | 預估速度 | 備註 | -|------|------|------------|---------|------| -| **Gemma 4 31B-it** | 33B | ~20 GB | 15-25 tok/s | 多模態,可處理圖像 | -| Gemma 4 26B-A4B-it | 27B MoE | ~15 GB | 25-40 tok/s | MoE,更快 | -| Gemma 4 E4B-it | 8B | ~5 GB | 60+ tok/s | 最快,品質較低 | +| 屬性 | 值 | +|------|-----| +| 模型 | **Gemma 4 31B-it**(Image-Text-to-Text) | +| 參數量 | 33B (30,697,345,596) | +| 量化 | Q5_K_M | +| GGUF 大小 | **20.16 GB** (`21658399744 bytes`) | +| Embedding dim | 5376 | +| Vocabulary | 262144 | +| Context | 4096 (訓練 262144) | +| 來源 | `unsloth/gemma-4-31B-it-GGUF` | +| HF 下載數 | 1,685,377 | +| HF 許可 | Gated(需 `huggingface-cli login`) | +| License | Gemma (Apache 2.0 derived) | -**推薦**: Gemma 4 31B-it (Q5_K_M)。48GB 記憶體綽綽有餘。 +## 3. Binary 與依賴 -## 3. 部署步驟 +### 3.1 建置方式 -### Step 1: 本機下載模型 +llama.cpp 從 source build,不透過 Homebrew。原因:Homebrew binary 有**絕對路徑** dylib 參照,無法搬移至 M5。 ```bash -# 登入 HuggingFace(需 access token) -huggingface-cli login - -# 下載 Gemma 4 31B GGUF -huggingface-cli download bartowski/gemma-4-31B-it-GGUF \ - gemma-4-31b-it-Q5_K_M.gguf \ - --local-dir ~/llama.cpp/models/ +# M4 上執行 +cd /tmp +git clone https://github.com/ggerganov/llama.cpp.git +cd llama.cpp +cmake -B build -DGGML_METAL=ON +cmake --build build -j10 --target llama-server ``` -### Step 2: 準備 llama.cpp binary +### 3.2 Binary 依賴 + +llama-server binary 依賴以下 dylib(共 26 個檔案): + +| 類別 | 檔案 | 來源 | +|------|------|------| +| 核心 GGML | `libggml.0.dylib`, `libggml.dylib` | `build/bin/` | +| 核心 GGML | `libggml-base.0.dylib`, `libggml-base.dylib` | `build/bin/` | +| Metal GPU | `libggml-metal.0.dylib`, `libggml-metal.dylib` | `build/bin/` | +| CPU | `libggml-cpu.0.dylib`, `libggml-cpu.dylib` | `build/bin/` | +| BLAS | `libggml-blas.0.dylib`, `libggml-blas.dylib` | `build/bin/` | +| LLama | `libllama.0.dylib`, `libllama.dylib` | `build/bin/` | +| LLamaCommon | `libllama-common.0.dylib`, `libllama-common.dylib` | `build/bin/` | +| MTMD | `libmtmd.0.dylib`, `libmtmd.dylib` | `build/bin/` | +| OpenSSL | `libssl.3.dylib`, `libcrypto.3.dylib` | `/opt/homebrew/opt/openssl@3/lib/` | + +### 3.3 @rpath 修復 + +build 時期 embedded 的 @rpath 指向 `/tmp/llama.cpp/build/bin/`,需改為 `@executable_path/../lib`。 + +在 **M4** 上執行(Xcode CLT 可用): ```bash -# llama.cpp 已安裝於本機 /opt/homebrew/bin/llama-server -# 收集依賴的 dylib -mkdir -p /tmp/llama_bundle/bin /tmp/llama_bundle/lib -cp /opt/homebrew/bin/llama-server /tmp/llama_bundle/bin/ -cp /opt/homebrew/lib/libggml*.dylib /tmp/llama_bundle/lib/ -cp /opt/homebrew/lib/libllama*.dylib /tmp/llama_bundle/lib/ +cp build/bin/llama-server /tmp/llama_final +chmod +w /tmp/llama_final + +# 修復 OpenSSL 絕對路徑 +install_name_tool -change /opt/homebrew/opt/openssl@3/lib/libssl.3.dylib @rpath/libssl.3.dylib /tmp/llama_final +install_name_tool -change /opt/homebrew/opt/openssl@3/lib/libcrypto.3.dylib @rpath/libcrypto.3.dylib /tmp/llama_final + +# 修復 GGML 絕對路徑(Homebrew build 才需要,source build 不需要) +install_name_tool -change /opt/homebrew/opt/ggml/lib/libggml.0.dylib @rpath/libggml.0.dylib /tmp/llama_final +install_name_tool -change /opt/homebrew/opt/ggml/lib/libggml-base.0.dylib @rpath/libggml-base.0.dylib /tmp/llama_final + +# 修正 @rpath +install_name_tool -delete_rpath /tmp/llama.cpp/build/bin /tmp/llama_final +install_name_tool -add_rpath @executable_path/../lib /tmp/llama_final + +# 重新簽章(install_name_tool 會破壞 code signature) +codesign --force --sign - /tmp/llama_final ``` -### Step 3: scp 到 M5 Max +### 3.4 libssl.3.dylib 自身也需修復 + +libssl.3.dylib 內部也參照了 `/opt/homebrew/Cellar/openssl@3/3.6.1/lib/libcrypto.3.dylib`: ```bash -# 傳送 binary -scp -r /tmp/llama_bundle/* accusys@10.10.10.10:~/bin/ - -# 傳送模型 -scp ~/llama.cpp/models/gemma-4-31b-it-Q5_K_M.gguf \ - accusys@10.10.10.10:~/models/ - -# 傳送模型(若檔案太大可分批或用 rsync) -rsync -avz --progress ~/llama.cpp/models/ accusys@10.10.10.10:~/models/ +cp /opt/homebrew/opt/openssl@3/lib/libssl.3.dylib /tmp/libssl_fixed.dylib +cp /opt/homebrew/opt/openssl@3/lib/libcrypto.3.dylib /tmp/libcrypto_fixed.dylib +chmod +w /tmp/libssl_fixed.dylib /tmp/libcrypto_fixed.dylib +install_name_tool -change /opt/homebrew/Cellar/openssl@3/3.6.1/lib/libcrypto.3.dylib @loader_path/libcrypto.3.dylib /tmp/libssl_fixed.dylib +codesign --force --sign - /tmp/libssl_fixed.dylib /tmp/libcrypto_fixed.dylib ``` -### Step 4: M5 Max 上啟動 +### 3.5 全部傳送至 M5 ```bash -ssh accusys@10.10.10.10 +# 模型(20GB) +scp ~/llama.cpp/models/gemma-4-31B-it-Q5_K_M.gguf \ + accusys@192.168.110.201:~/models/ -# 設定 library path -export DYLD_LIBRARY_PATH=~/bin/lib:$DYLD_LIBRARY_PATH - -# 啟動 llama-server -~/bin/bin/llama-server \ - -m ~/models/gemma-4-31b-it-Q5_K_M.gguf \ - --host 0.0.0.0 \ - --port 8081 \ - --n-gpu-layers 999 \ - --ctx-size 8192 \ - --threads 10 \ - --parallel 2 \ - --mlock +# binary + 全部 dylib +ssh accusys@192.168.110.201 'rm -rf ~/llama && mkdir -p ~/llama/bin ~/llama/lib' +scp /tmp/llama_final accusys@192.168.110.201:~/llama/bin/llama-server +scp /tmp/llama.cpp/build/bin/*.dylib accusys@192.168.110.201:~/llama/lib/ +scp /tmp/libssl_fixed.dylib accusys@192.168.110.201:~/llama/lib/libssl.3.dylib +scp /tmp/libcrypto_fixed.dylib accusys@192.168.110.201:~/llama/lib/libcrypto.3.dylib ``` -## 4. 記憶體分配 +## 4. 啟動與驗證 -``` -48 GB total - ├─ 20 GB Gemma 4 31B Q5_K_M - ├─ 4 GB PostgreSQL - ├─ 1 GB Redis - ├─ 1 GB MongoDB + Qdrant - ├─ 2 GB swift_face / face_processor (burst) - ├─ 3 GB llama-server overhead - └─ 17 GB 剩餘 (OS + buffer) -``` - -## 5. Momentry 整合 - -更新 `.env` 或 config: +### 4.1 一次性手動啟動 ```bash -MOMENTRY_LLM_ENDPOINT=http://10.10.10.10:8081/v1 -MOMENTRY_LLM_MODEL=gemma-4-31b-it +ssh accusys@192.168.110.201 +export DYLD_LIBRARY_PATH=$HOME/llama/lib +codesign --force --sign - ~/llama/bin/llama-server +codesign --force --sign - ~/llama/lib/*.dylib +nohup ~/llama/bin/llama-server \ + -m ~/models/gemma-4-31B-it-Q5_K_M.gguf \ + --host 0.0.0.0 --port 8081 \ + --n-gpu-layers 999 --ctx-size 4096 \ + --threads 10 --mlock \ + --reasoning off \ + > ~/llama.log 2>&1 & ``` -Agent 端點改用 LLM: -- `POST /api/v1/agents/translate` → llama.cpp server -- `POST /api/v1/agents/identity/suggest` → llama.cpp server -- `POST /api/v1/agents/5w1h/analyze` → llama.cpp server -- `POST /api/v1/agents/suggest/merge` → llama.cpp server +### 4.2 啟動腳本 -## 6. 測試驗證 +`~/start_llm.sh`(已建立): ```bash -# Health check -curl http://10.10.10.10:8081/health +#!/bin/bash +export DYLD_LIBRARY_PATH=$HOME/llama/lib +pkill -9 -f llama-server 2>/dev/null +sleep 1 +nohup $HOME/llama/bin/llama-server \ + -m $HOME/models/gemma-4-31B-it-Q5_K_M.gguf \ + --host 0.0.0.0 --port 8081 \ + --n-gpu-layers 999 --ctx-size 4096 \ + --threads 10 --mlock \ + --reasoning off \ + > $HOME/llama.log 2>&1 & +echo "llama-server PID: $!" +``` -# Inference test -curl http://10.10.10.10:8081/v1/chat/completions \ +### 4.3 參數說明 + +| 參數 | 值 | 說明 | +|------|-----|------| +| `-m` | `~/models/gemma-4-31B-it-Q5_K_M.gguf` | 模型路徑 | +| `--host` | `0.0.0.0` | 綁定所有網路介面 | +| `--port` | `8081` | HTTP API port | +| `--n-gpu-layers` | `999` | 所有層進 GPU (Metal) | +| `--ctx-size` | `4096` | 上下文長度 | +| `--threads` | `10` | M5 Max P-core 數量 | +| `--mlock` | — | 鎖住記憶體以防 swap | +| `--reasoning` | `off` | 關閉 thinking,否則 content 進 `reasoning_content` | +| `DYLD_LIBRARY_PATH` | `~/llama/lib` | dylib 搜尋路徑 | + +### 4.4 啟動過程中遇到的問題 + +| # | 問題 | 原因 | 解決 | +|---|------|------|------| +| 1 | `Library not loaded: libmtmd.0.dylib` | 未拷貝 Metal 相關 dylib | 從 build 拷貝全部 26 個 dylib | +| 2 | `Library not loaded: /opt/homebrew/.../libssl.3.dylib` | binary 有 OpenSSL 絕對路徑 | `install_name_tool -change → @rpath` | +| 3 | `Killed: 9` (exit 137) | code signature 被破壞 | `codesign --force --sign -` | +| 4 | `Library not loaded: /opt/homebrew/Cellar/.../libcrypto.3.dylib` | libssl.3.dylib 內部也有絕對路徑 | `install_name_tool` 修復 libssl | +| 5 | `no backends are loaded` | 缺少 Metal GPU backend | source build 時需 `-DGGML_METAL=ON` | +| 6 | `couldn't bind HTTP server socket` | 前一個 process 未完全釋放 port | `pkill -9 -f llama-server` 先 | +| 7 | **content 全在 reasoning_content** | Gemma4 預設為 thinking model | `--reasoning off` | + +## 5. API 驗證 + +### 5.1 Health Check + +```bash +curl -s http://192.168.110.201:8081/health +# → {"status":"ok"} +``` + +### 5.2 推理測試(--reasoning off 後) + +```bash +curl -s http://192.168.110.201:8081/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ - "model": "gemma-4-31b-it", + "model": "gemma-4-31B-it-Q5_K_M.gguf", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 100 }' ``` -## 7. 啟動腳本(M5 Max 上) +回應(OpenAI-compatible): -```bash -#!/bin/bash -# ~/start_llm.sh -export DYLD_LIBRARY_PATH=~/bin/lib:$DYLD_LIBRARY_PATH -exec ~/bin/bin/llama-server \ - -m ~/models/gemma-4-31b-it-Q5_K_M.gguf \ - --host 0.0.0.0 --port 8081 \ - --n-gpu-layers 999 --ctx-size 8192 \ - --threads 10 --parallel 2 --mlock \ - >> ~/llama.log 2>&1 +```json +{ + "choices": [{ + "finish_reason": "stop", + "message": { + "role": "assistant", + "content": "Hello! How can I help you today?", + "reasoning_content": "" + } + }], + "usage": { + "completion_tokens": 100, + "prompt_tokens": 18, + "total_tokens": 118 + }, + "model": "gemma-4-31B-it-Q5_K_M.gguf", + "object": "chat.completion" +} ``` -## 8. 風險與備案 +### 5.3 效能 -| 風險 | 備案 | +| 指標 | 實測 | |------|------| -| GGUF 下載失敗(HF gated) | 用 ollama pull + ollama export to GGUF | -| M5 Max Metal 不相容 | 改用 CPU only (`--n-gpu-layers 0`) | -| 31B 太大速度太慢 | 改用 26B-A4B (MoE, 更快) | -| scp 傳輸中斷 | 用 rsync --partial 續傳 | +| Prompt 速度 | 60.8 tok/s | +| 生成速度 | **25.8 tok/s** | +| Prompt 延遲 | 296 ms(18 tokens) | +| 生成延遲 | 387 ms(10 tokens) | + +## 6. 整合至 OpenCode + +`~/.config/opencode/config.json` 中新增 provider: + +```json +{ + "m5-gemma4": { + "npm": "@ai-sdk/openai-compatible", + "name": "M5 Max Gemma 4", + "options": { "baseURL": "http://192.168.110.201:8081/v1" }, + "models": { + "gemma-4-31B-it-Q5_K_M.gguf": { "name": "Gemma 4 31B" } + } + } +} +``` + +預設 model 設為 `"m5-gemma4/gemma-4-31B-it-Q5_K_M.gguf"`。Provider list 確認: + +```bash +opencode models m5-gemma4 +# → m5-gemma4/gemma-4-31B-it-Q5_K_M.gguf +``` + +## 7. M5 網路異動記錄 + +| 時間 | IP | 網路 | 原因 | +|------|-----|------|------| +| 初始 | `10.10.10.10` | bridge (Thunderbolt) | 無外網,需透過 M4 NAT | +| 切換後 | `192.168.110.201` | en0 (WiFi/Ethernet) | 改接同網段,有外網 | + +## 8. Rust 安裝(for Momentry dev) + +```bash +curl --proto "=https" --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y +source $HOME/.cargo/env +``` + +- rustc 1.95.0 +- cargo 1.95.0 +- 免 sudo + +## 9. 記憶體使用 + +``` +48 GB total + ├─ 20 GB Gemma 4 31B Q5_K_M (process RSS ~28 GB) + ├─ 4 GB macOS + 系統 + └─ 24 GB 剩餘 +``` + +實測啟動後 RSS: `28,325,600 KB` (~28 GB)。 + +## 10. 維護指令 + +| 操作 | 指令 | +|------|------| +| 啟動 | `ssh accusys@192.168.110.201 '~/start_llm.sh'` | +| 停止 | `ssh accusys@192.168.110.201 'pkill -9 -f llama-server'` | +| 查看日誌 | `ssh accusys@192.168.110.201 'tail -50 ~/llama.log'` | +| 健康檢查 | `curl http://192.168.110.201:8081/health` | +| 模型檔案 | `~/models/gemma-4-31B-it-Q5_K_M.gguf (20G)` | +| Binary 與 lib | `~/llama/bin/llama-server`, `~/llama/lib/*.dylib` | +| config | `~/.config/opencode/config.json` | +| 監控 | `htop -p $(pgrep llama-server)` | +| 記憶體 | `ps -o rss= -p $(pgrep llama-server)` | + +## 11. 已知限制 + +- **Thinking model**: Gemma4 為 thinking 模型(`--reasoning off` 關閉後 content 正常,但某些場景可能需要 reasoning) +- **無 Homebrew**: 非 admin 帳號,無法 `brew install`。Momentry 其他服務(PostgreSQL, Redis, MongoDB)需用 portable binary 手動安裝 +- **無 Xcode CLT**: `install_name_tool`, `codesign` 不可用於 M5。binary 修復需在 M4 完成後 scp