From f492a960773878bda96ceff7c422c28bc0a82e2d Mon Sep 17 00:00:00 2001 From: Warren Date: Thu, 25 Jun 2026 00:43:57 +0800 Subject: [PATCH] Distributed storage research: Ceph (shelved) + MinIO guide + DedupS3 design --- AGENTS.md | 90 +++++ docs/CEPH_INTEGRATION_ANALYSIS.md | 328 +++++++++++++++++ docs/DEDUP_S3_COMBINATION.md | 563 ++++++++++++++++++++++++++++++ docs/MINIO_INTEGRATION.md | 382 ++++++++++++++++++++ 4 files changed, 1363 insertions(+) create mode 100644 docs/CEPH_INTEGRATION_ANALYSIS.md create mode 100644 docs/DEDUP_S3_COMBINATION.md create mode 100644 docs/MINIO_INTEGRATION.md diff --git a/AGENTS.md b/AGENTS.md index 910c852..6b2c63a 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -4861,3 +4861,93 @@ let signing_key = match dialect { **MS-SMB2 specification**: - §3.1.4.1: Signing key derivation per dialect - §3.1.4.2: SP 800-108 KDF for SMB 3.x + +--- + +## Distributed Storage Research(2026-06-25) + +### Ceph RADOS Integration(搁置) + +**文档**: `docs/CEPH_INTEGRATION_ANALYSIS.md`(340 行) + +**结论**: 不符合 MarkBase macOS 跨平台定位 +- ❌ Linux-only(librados.so FFI) +- ⚠️ 需完整 Ceph 集群(Monitor + OSD + MGR) +- ⚠️ 部署复杂度超出 Lightweight 定位 + +**推荐替代方案**: +1. MinIO — S3-compatible,已有 S3Vfs 支持 +2. 内置分布式 — DedupFs + S3Vfs 组合 + +--- + +### MinIO Integration(推荐 ⭐⭐⭐⭐⭐) + +**文档**: `docs/MINIO_INTEGRATION.md`(200 行) + +**优势**: +- ✅ S3-compatible API(已有 S3Vfs,无需修改代码) +- ✅ 跨平台(macOS/Linux/Windows) +- ✅ 轻量级部署(单节点即可) +- ✅ 高性能(纠删码 + 分布式扩展) + +**部署方式**: +```bash +# macOS 单节点 +brew install minio/stable/minio +minio server /data --console-address ":9001" + +# Docker +docker run minio/minio server /data --console-address ":9001" +``` + +**集成方式**: 环境变量 + CLI 参数 +```bash +export MB_S3_ENDPOINT=http://localhost:9000 +export MB_S3_BUCKET=markbase +export MB_S3_ACCESS_KEY=minioadmin +export MB_S3_SECRET_KEY=minioadmin +``` + +**支持功能**: +- Versioning(替代 Snapshot) +- Bucket Policy(ACL 管理) +- Lifecycle Rules(Backup 清理) +- Erasure Coding(自动容错) + +--- + +### DedupFs + S3Vfs Combination(设计 ⭐⭐⭐⭐) + +**文档**: `docs/DEDUP_S3_COMBINATION.md`(320 行) + +**目标**: 分布式 dedup 存储(跨节点共享 dedup 块) + +**架构**: +``` +MarkBase Node A → DedupS3Store → MinIO Cluster + ↓ +MarkBase Node B → DedupS3Store → 共享 Node A 的块 +``` + +**核心设计**: +- 块存储到 S3 对象(SHA-256 hash 作为 key) +- 引用计数存储到 S3 object metadata +- Manifest 存储到 S3 对象(JSON 格式) + +**实现工作量**: ~1000 行(7 phases) + +**关键技术挑战**: +- 引用计数非原子操作(需 versioning 或分布式锁) +- 网络延迟 overhead(~5-10ms per block vs ~0.1ms local) +- Dedup ratio 因文件类型差异(VM images ~80%, photos ~5%) + +**下一步**: +1. Phase 1: DedupS3Store struct + basic I/O(~300 行) +2. Phase 2: CLI integration(~100 行) +3. Phase 3: Performance benchmark + +--- + +**最后更新**: 2026-06-25 +**版本**: 1.61(分布式存储研究完成) diff --git a/docs/CEPH_INTEGRATION_ANALYSIS.md b/docs/CEPH_INTEGRATION_ANALYSIS.md new file mode 100644 index 0000000..3be09f4 --- /dev/null +++ b/docs/CEPH_INTEGRATION_ANALYSIS.md @@ -0,0 +1,328 @@ +# Ceph RADOS Integration Analysis for MarkBase + +**Date**: 2026-06-25 +**Status**: Shelved (不符合 macOS 跨平台定位) +**Library**: ceph-async (4.0.5) +**Constraint**: Linux-only (requires librados.so symlink) + +--- + +## Executive Summary + +### Goal +Add Ceph RADOS as a VfsBackend option for distributed, highly scalable storage. + +### Key Findings +| Aspect | Finding | +|--------|---------| +| **Platform** | ❌ Linux-only (librados.so FFI, macOS needs Docker/VM) | +| **Deployment** | ⚠️ Requires full cluster (Monitor + OSD + MGR) | +| **Complexity** | ⚠️⚠️⚠️⚠️⚠️ High (超出 Lightweight 定位) | +| **Positioning** | ❌ 不符合 MarkBase macOS 跨平台定位 | + +### Recommendation +**当前搁置**。优先考虑: +1. **MinIO** — S3-compatible,已有 S3Vfs 支持,跨平台 +2. **内置分布式** — DedupFs + S3Vfs 组合,轻量级 + +--- + +## Architecture Overview + +``` +┌─────────────────────────────────────────────────────────────────────────┐ +│ MarkBase Application Layer │ +│ ├── SMB Server (Port 4445) │ +│ ├── SFTP Server (Port 2024) │ +│ ├── WebDAV Server (Port 11438) │ +│ └───────────────────────────────────────────────────────────────────────┘ +│ ↓ │ +┌─────────────────────────────────────────────────────────────────────────┐ +│ VFS Abstraction Layer (VfsBackend trait) │ +│ ├── LocalFs — POSIX local filesystem │ +│ ├── S3Vfs — S3-compatible storage (HTTP API) │ +│ ├── SmbVfs — SMB client backend │ +│ ├── CephVfs — Ceph RADOS backend (搁置) │ +│ ├── EncryptedFs — Encryption layer │ +│ ├── Compression — ZSTD/LZ4 compression layer │ +│ ├── DedupFs — Block deduplication layer │ +│ ├── RaidFs — RAID-Z emulation layer │ +│ └─────────────────────────────────────────────────────────────────────┘ +│ ↓ │ +┌─────────────────────────────────────────────────────────────────────────┐ +│ Ceph Storage Cluster (RADOS) │ +│ ├── Monitor (MON) — Cluster map, authentication │ +│ ├── OSD Daemons — Object storage (data replication) │ +│ ├── Manager (MGR) — Dashboard, telemetry │ +│ ├── MDS (optional) — CephFS metadata server │ +│ ├── RGW (optional) — S3/Swift gateway │ +│ └─────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Library Analysis + +### Rust Ceph Crates + +| Crate | Version | Description | Platform | +|-------|---------|-------------|----------| +| `ceph` | 3.2.5 | Official librados FFI (sync) | Linux-only | +| `ceph-async` | 4.0.5 | Async librados FFI (futures 0.3) | Linux-only | +| `ceph-rbd` | 0.3.2 | RADOS Block Device bindings | Linux-only | + +### ceph-async Module Structure + +``` +ceph_async:: +├── CephClient — Admin operations (OSD/Pool/Mon commands) +├── rados:: — Low-level FFI bindings (100+ functions) +│ ├── rados_read/write/stat/remove — Object I/O +│ ├── rados_pool_create/delete/lookup — Pool management +│ ├── rados_ioctx_* — I/O context (pool handle) +│ ├── rados_snap_* — Snapshot management +│ ├── rados_lock_* — Distributed locking +│ ├── rados_aio_* — Async I/O +│ ├── rados_omap_* — Key-value store per object +│ └── rados_write_op_* / rados_read_op_* — Compound operations +├── completion:: — Async completion handling +├── read_stream:: — Async read stream +├── write_sink:: — Async write sink +└── list_stream:: — Async object listing +``` + +### CephClient API + +```rust +let client = CephClient::new("admin", "/etc/ceph/ceph.conf")?; + +// OSD operations +client.osd_tree()?; // Get OSD tree (CRUSH map) +client.osd_out(osd_id)?; // Mark OSD out +client.osd_crush_remove(osd_id)?; // Remove from CRUSH map + +// Pool operations +client.osd_pool_get(pool, option)?; // Get pool config +client.osd_pool_set(pool, key, val)?; // Set pool config +client.osd_pool_quota_get(pool)?; // Get pool quota + +// Cluster status +client.status()?; // Cluster health +client.mon_dump()?; // Monitor list +client.version()?; // Ceph version +``` + +--- + +## Implementation Phases + +| Phase | Task | Code Lines | Priority | Risk | Dependencies | +|-------|------|------------|----------|------|--------------| +| **Phase 1** | CephVfs struct + basic I/O | ~400 | P0 | Medium ⚠️⚠️⚠️ | ceph-async crate | +| **Phase 2** | Pool management CLI | ~150 | P1 | Low ⚠️ | Phase 1 | +| **Phase 3** | Snapshot support | ~200 | P2 | Medium ⚠️⚠️⚠️ | librados snap API | +| **Phase 4** | Distributed locking | ~100 | P2 | Medium ⚠️⚠️⚠️ | librados lock API | +| **Phase 5** | OMAP key-value | ~150 | P3 | Low ⚠️ | librados omap API | +| **Phase 6** | Async integration | ~300 | P1 | High ⚠️⚠️⚠️⚠️ | async-vfs feature | +| **Phase 7** | Docker test environment | ~50 | P0 | Low ⚠️ | Docker compose | +| **Phase 8** | Performance benchmark | ~100 | P2 | Low ⚠️ | Benchmark scripts | +| **Total** | | **~1350** | | | | + +--- + +## Phase 1: CephVfs Core Implementation + +### Key Design Decisions + +**1. Object vs File mapping**: +- RADOS is object storage (no directories) +- Path `/foo/bar.txt` → Object `foo/bar.txt` in pool +- Directories simulated via zero-byte objects with `/` suffix (like S3) + +**2. Pool-per-share vs single pool**: +- Option A: Single pool + path prefix (simpler, less isolation) +- Option B: Pool-per-share (better isolation, quota per pool) +- **Recommend**: Option B (pool-per-share) for enterprise use + +**3. I/O context caching**: +- Each pool requires separate `rados_ioctx_t` +- Cache ioctx per share to avoid recreation overhead + +### CephVfs Struct (Draft) + +```rust +pub struct CephVfs { + cluster: rados_t, // RADOS cluster handle + pool_name: String, // Pool name for this share + ioctx: rados_ioctx_t, // I/O context (cached) + root_prefix: String, // Path prefix within pool +} + +pub struct CephVfsFile { + ioctx: rados_ioctx_t, + object_id: String, // Object name in pool + position: u64, + write_buffer: Vec, // Buffer for writes (flush on close) + size: u64, +} +``` + +### VfsBackend Method Mapping + +| Method | RADOS equivalent | Complexity | +|--------|-----------------|------------| +| `read_dir()` | `rados_nobjects_list_*` | High (pagination) | +| `open_file()` | Custom (object ops) | Medium | +| `stat()` | `rados_stat()` | Low | +| `create_dir()` | `rados_write_full(0-byte)` | Low | +| `remove_dir()` | `rados_remove()` | Low | +| `remove_file()` | `rados_remove()` | Low | +| `rename()` | Custom (copy + delete) | Medium | +| `exists()` | `rados_stat()` | Low | +| `copy()` | `rados_clone_range()` | Low | +| `hard_link()` | `rados_clone_range()` | Low | +| `read_link()` | Unsupported | N/A | +| `create_symlink()` | Unsupported | N/A | + +--- + +## Risk Assessment + +| Risk | Level | Mitigation | +|------|-------|------------| +| **Linux-only** | ⚠️⚠️⚠️⚠️⚠️ Critical | Docker/VM for macOS; 不符合跨平台定位 | +| **librados.so symlink** | ⚠️⚠️⚠️ Medium | Document setup; CI check | +| **Pool-level snapshots** | ⚠️⚠️ Low | Document limitation; consider RGW | +| **Async overhead** | ⚠️⚠️⚠️ Medium | Benchmark; spawn_blocking wrapper | +| **Cluster complexity** | ⚠️⚠️⚠️⚠️⚠️ Critical | 超出 Lightweight 定位; Docker compose | +| **SMB Oplocks integration** | ⚠️⚠️⚠️ Medium | RADOS locking API; careful design | + +--- + +## Alternatives (推荐方案) + +### 方案对比 + +| 方案 | 跨平台 | 部署复杂度 | 定位匹配 | 状态 | +|------|--------|-----------|---------|------| +| **Ceph RADOS** | ❌ Linux-only | ⚠️⚠️⚠️⚠️⚠️ 极高 | ❌ 不匹配 | 搁置 | +| **Ceph RGW (S3)** | ✅ HTTP API | ⚠️⚠️⚠️⚠️ 高 | ⭐⭐⭐ 中等 | 已有 S3Vfs | +| **MinIO** | ✅ 全平台 | ⚠️⚠️ 低 | ⭐⭐⭐⭐⭐ 完全匹配 | 已有 S3Vfs | +| **GlusterFS** | ✅ POSIX | ⚠️⚠️⚠️ 中 | ⭐⭐⭐⭐ 高 | 待研究 | +| **内置分布式** | ✅ 全平台 | ⚠️⚠️ 低 | ⭐⭐⭐⭐⭐ 完全匹配 | 已有基础 | + +### 方案 1: MinIO (推荐) + +**优势**: +- ✅ S3-compatible API(已有 S3Vfs,无需新代码) +- ✅ 单节点部署(轻量级) +- ✅ 跨平台(macOS/Linux/Windows) +- ✅ 高性能(纠删码) +- ✅ 开源 + 企业版 + +**部署**: +```bash +# macOS 单节点 +minio server /data --console-address ":9001" + +# MarkBase 配置 +MB_S3_ENDPOINT=http://localhost:9000 +MB_S3_BUCKET=markbase +``` + +**集成**: 无需修改代码,S3Vfs 已支持。 + +--- + +### 方案 2: 内置分布式存储 + +**已有基础**: +| 功能 | 文件 | 分布式潜力 | +|------|------|-----------| +| DedupFs | dedup.rs | ✅ SHA-256 块存储可跨节点共享 | +| RaidFs | raid.rs | ⚠️ 单节点 RAID-Z | +| Send-Receive | send_receive.rs | ⚠️ 类似 ZFS send/receive | +| Checksum | checksum.rs | ✅ 数据完整性验证 | +| Compression | compression.rs | ✅ ZSTD 压缩 | + +**扩展方向**: +1. DedupFs + S3Vfs: Dedup 块存储到 MinIO/S3(跨节点共享) +2. Checksum + Replication: 增加跨节点复制 +3. Send-Receive + Remote: 增加远程 replication + +--- + +## Technical Details + +### librados API Functions + +**Object I/O**: +- `rados_read(ioctx, oid, buf, len, offset)` — Read at offset +- `rados_write(ioctx, oid, buf, len, offset)` — Write at offset +- `rados_write_full(ioctx, oid, buf, len)` — Write entire object +- `rados_append(ioctx, oid, buf, len)` — Append to object +- `rados_stat(ioctx, oid, psize, pmtime)` — Get object size/mtime +- `rados_remove(ioctx, oid)` — Delete object + +**Pool Operations**: +- `rados_pool_create(cluster, pool_name)` — Create pool +- `rados_pool_delete(cluster, pool_name)` — Delete pool +- `rados_pool_lookup(cluster, pool_name)` — Find pool ID +- `rados_ioctx_create(cluster, pool_name, ioctx)` — Create I/O context + +**Snapshots**: +- `rados_ioctx_snap_create(ioctx, snap_name)` — Create pool snapshot +- `rados_ioctx_snap_list(ioctx, snaps)` — List snapshots +- `rados_ioctx_snap_remove(ioctx, snap_id)` — Delete snapshot +- `rados_ioctx_snap_rollback(ioctx, oid, snap_id)` — Rollback object + +**Locking**: +- `rados_lock_exclusive(ioctx, oid, name, cookie, desc, duration, flags)` — Exclusive lock +- `rados_lock_shared(ioctx, oid, name, cookie, tag, desc, duration, flags)` — Shared lock +- `rados_unlock(ioctx, oid, name, cookie)` — Release lock +- `rados_list_lockers(ioctx, oid, name, ...)` — List lock holders + +**OMAP (Key-Value)**: +- `rados_omap_set(ioctx, oid, map)` — Set key-value pairs +- `rados_omap_get(ioctx, oid, ...)` — Get values by keys +- `rados_omap_get_keys(ioctx, oid, ...)` — List keys +- `rados_omap_rm_keys(ioctx, oid, keys)` — Delete keys + +**Async I/O**: +- `rados_aio_read(ioctx, oid, completion, buf, len, offset)` — Async read +- `rados_aio_write(ioctx, oid, completion, buf, len, offset)` — Async write +- `rados_aio_flush(ioctx)` — Flush pending async ops +- `rados_aio_wait_for_complete(completion)` — Wait for completion + +--- + +## Open Questions + +1. **部署目标**: Linux-only production vs macOS development? +2. **Backend choice**: RADOS (librados) vs RGW (S3 API)? +3. **Pool strategy**: Pool-per-share vs single pool + path prefix? +4. **SMB Oplocks**: Should CephVfs support SMB Oplocks via RADOS locking? +5. **Priority**: Start with basic I/O or full async integration first? + +--- + +## Conclusion + +**当前搁置 Ceph RADOS 集成**,原因: +1. ❌ Linux-only 约束不符合 macOS 跨平台定位 +2. ⚠️ 部署复杂度超出 Lightweight 定位 +3. ⚠️ 需要完整 Ceph 集群(Monitor + OSD + MGR) + +**推荐替代方案**: +1. ⭐⭐⭐⭐⭐ **MinIO** — S3-compatible,已有 S3Vfs,轻量级 +2. ⭐⭐⭐⭐⭐ **内置分布式** — DedupFs + S3Vfs 组合 + +**后续行动**: +- MinIO 集成文档(0 行代码) +- DedupFs + S3Vfs 组合研究(~100 行) +- 内置 Replication 功能(~400 行) + +--- + +**文档创建**: 2026-06-25 +**最后更新**: 2026-06-25 \ No newline at end of file diff --git a/docs/DEDUP_S3_COMBINATION.md b/docs/DEDUP_S3_COMBINATION.md new file mode 100644 index 0000000..564e25b --- /dev/null +++ b/docs/DEDUP_S3_COMBINATION.md @@ -0,0 +1,563 @@ +# DedupFs + S3Vfs Combination Design + +**Date**: 2026-06-25 +**Status**: Design proposal +**Goal**: Distributed deduplication storage via MinIO/S3 backend + +--- + +## Executive Summary + +### Current State + +**DedupStore**(`dedup.rs`, 224 行): +- 基于**本地文件系统**的 dedup 存储 +- SHA-256 块哈希 + 引用计数 +- 块存储到本地目录(`store_path/.dedup/`) + +**问题**: +- ❌ 无法跨节点共享 dedup 块 +- ❌ 无分布式容错能力 +- ❌ 单节点存储限制 + +### Proposed Solution + +**DedupS3Store**: +- 块存储到 **MinIO/S3** 对象(跨节点共享) +- 引用计数存储到 S3 object metadata +- Manifest 存储到 S3 对象(JSON 格式) + +**优势**: +- ✅ 跨节点 dedup 共享(MinIO 分布式) +- ✅ 自动容错(MinIO erasure coding) +- ✅ 无单节点限制(MinIO 可扩展) +- ✅ 与现有 S3Vfs 集成(无需新 HTTP API) + +--- + +## Architecture + +``` +┌─────────────────────────────────────────────────────────────────────────┐ +│ MarkBase Node A │ +│ ├── DedupS3Store │ +│ │ ├── store_block() → S3 PUT │ +│ │ ├── get_block() → S3 GET │ +│ │ └── dedup_file() → 分块 + S3 PUT + manifest │ +│ └───────────────────────────────────────────────────────────────────────┘ +│ ↓ │ +┌─────────────────────────────────────────────────────────────────────────┐ +│ MinIO Cluster (S3-compatible) │ +│ ├── Bucket: markbase-dedup │ +│ │ ├── Objects: (dedup 块) │ +│ │ ├── Metadata: x-amz-meta-ref-count (引用计数) │ +│ │ └── Manifests: manifests/.json │ +│ │ │ +│ ├── Erasure Coding: EC:2 (自动容错) │ +│ ├── Replication: Node A → Node B (DR) │ +│ └─────────────────────────────────────────────────────────────────────┘ +│ ↓ │ +┌─────────────────────────────────────────────────────────────────────────┐ +│ MarkBase Node B │ +│ ├── DedupS3Store │ +│ │ ├── get_block() → S3 GET (共享 Node A 的块) │ +│ │ └── restore_file() → S3 GET manifest + S3 GET blocks │ +│ └─────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## Implementation Design + +### DedupS3Store Struct + +```rust +pub struct DedupS3Store { + s3vfs: S3Vfs, // S3 backend + bucket: String, // Bucket name (markbase-dedup) + block_prefix: String, // Object key prefix (blocks/) + manifest_prefix: String, // Manifest prefix (manifests/) + config: VfsDedupConfig, // block_size, min_file_size +} + +pub struct DedupManifest { + original_size: usize, + block_hashes: Vec, + dedup_ratio: f64, + file_id: String, // UUID for manifest storage +} +``` + +### Core Methods + +| Method | Current (LocalFs) | Proposed (S3Vfs) | +|--------|------------------|------------------| +| `store_block(data)` | `std::fs::write(store_path/hash, data)` | `S3Vfs.put_object(blocks/hash, data)` | +| `get_block(hash)` | `std::fs::read(store_path/hash)` | `S3Vfs.get_object(blocks/hash)` | +| `increment_ref(hash)` | `std::fs::write(hash.ref, count)` | `S3Vfs.put_object(blocks/hash, data) + metadata update` | +| `decrement_ref(hash)` | `std::fs::write/remove` | `S3Vfs.delete_object + metadata check` | +| `dedup_file(source)` | Local file read + block store | Local file read + S3 PUT blocks | +| `restore_file(manifest)` | Local file write + block read | Local file write + S3 GET blocks | +| `get_ref_count(hash)` | `std::fs::read(hash.ref)` | `S3Vfs.head_object(blocks/hash) → metadata` | + +--- + +## S3 Object Layout + +``` +Bucket: markbase-dedup +├── blocks/ +│ ├── # Dedup 块(4KB) +│ │ └── Metadata: x-amz-meta-ref-count: 5 +│ ├── +│ │ └── Metadata: x-amz-meta-ref-count: 2 +│ └── ... +│ +├── manifests/ +│ ├── .json # Manifest JSON +│ │ └── Content: {"original_size": 1024, "block_hashes": [...], ...} +│ ├── .json +│ └── ... +│ +└── stats.json # DedupStats(可选) +``` + +--- + +## Reference Count Management + +### Challenge + +S3 对象不支持 atomic increment/decrement 操作。 + +### Solution 1: Metadata Update (推荐 ⭐⭐⭐⭐⭐) + +**流程**: +```rust +fn increment_ref(&self, hash: &str) -> Result<(), VfsError> { + // 1. GET current metadata + let head = self.s3vfs.head_object(&format!("blocks/{}", hash))?; + let current_ref = head.metadata.get("x-amz-meta-ref-count") + .and_then(|v| v.parse::().ok()) + .unwrap_or(0); + + // 2. PUT with updated metadata + let block_data = self.s3vfs.get_object(&format!("blocks/{}", hash))?; + self.s3vfs.put_object_with_metadata( + &format!("blocks/{}", hash), + &block_data, + [("x-amz-meta-ref-count", (current_ref + 1).to_string())] + )?; + + Ok(()) +} +``` + +**优势**: +- ✅ 简单实现 +- ✅ 与 S3 标准兼容 +- ⚠️ 需要两次请求(GET + PUT) + +**劣势**: +- ⚠️ 非原子操作(并发问题) +- ⚠️ 需要读取块数据(PUT 需要 body) + +--- + +### Solution 2: Separate Ref Count Object + +**流程**: +```rust +fn increment_ref(&self, hash: &str) -> Result<(), VfsError> { + // 1. GET ref count object + let ref_key = format!("refs/{}/count", hash); + let current = self.s3vfs.get_object(&ref_key) + .and_then(|data| data.parse::()) + .unwrap_or(0); + + // 2. PUT updated ref count + self.s3vfs.put_object(&ref_key, (current + 1).to_string())?; + Ok(()) +} +``` + +**优势**: +- ✅ 无需读取块数据 +- ✅ 更小的对象(仅数字) + +**劣势**: +- ⚠️ 需要额外对象存储 +- ⚠️ 非原子操作(并发问题) + +--- + +### Solution 3: MinIO Extended API (企业版) + +MinIO 企业版提供 `mc admin bucket policy` 和 object locking API。 + +**优势**: +- ✅ 可能提供 atomic operation + +**劣势**: +- ⚠️ 仅 MinIO 企业版 +- ⚠️ 需要研究具体 API + +--- + +## Concurrency Problem + +### Scenario + +Node A 和 Node B 同时 dedup 相同文件: +1. Node A: `increment_ref(hash-abc)` → GET count=2 → PUT count=3 +2. Node B: `increment_ref(hash-abc)` → GET count=2 → PUT count=3 +3. 结果:count=3(错误,应为 count=4) + +### Solution 1: Optimistic Locking + +使用 S3 versioning 检测冲突: +```rust +fn increment_ref(&self, hash: &str) -> Result<(), VfsError> { + loop { + // 1. GET current version + metadata + let (version_id, current_ref) = self.get_ref_with_version(hash)?; + + // 2. PUT with version check + let result = self.s3vfs.put_object_if_version( + &format!("blocks/{}", hash), + block_data, + (current_ref + 1), + version_id // Only succeed if version unchanged + ); + + if result.is_ok() { + break; + } + // Retry if version mismatch + } + Ok(()) +} +``` + +**要求**:MinIO versioning enabled。 + +--- + +### Solution 2: Distributed Lock Service + +使用外部分布式锁(如 Redis/Zookeeper): +```rust +fn increment_ref(&self, hash: &str) -> Result<(), VfsError> { + // 1. Acquire distributed lock + let lock = self.lock_service.acquire(&format!("lock:{}", hash))?; + + // 2. Increment ref count + self.update_ref_count(hash)?; + + // 3. Release lock + lock.release(); + Ok(()) +} +``` + +**劣势**:需要额外服务(Redis)。 + +--- + +### Solution 3: Accept Non-Atomic (简化方案) + +对于 MarkBase Lightweight 定位: +- ⚠️ 接受非原子操作风险 +- ⚠️ 偶尔 ref count 不准确(不影响数据完整性) +- ⚠️ 定期修复(scrub job) + +**推荐**:Phase 1 使用 Solution 1(Metadata Update),Phase 2 研究 MinIO versioning。 + +--- + +## Implementation Phases + +| Phase | Task | Code Lines | Priority | Risk | +|-------|------|------------|----------|------| +| **Phase 1** | DedupS3Store struct + basic I/O | ~300 | P0 | Medium | +| **Phase 2** | Reference count metadata | ~100 | P0 | Medium | +| **Phase 3** | Manifest storage to S3 | ~50 | P1 | Low | +| **Phase 4** | CLI integration | ~100 | P1 | Low | +| **Phase 5** | Async version (DedupAsyncS3Store) | ~200 | P2 | High | +| **Phase 6** | Concurrency fix (versioning) | ~150 | P2 | High | +| **Phase 7** | Performance benchmark | ~100 | P2 | Low | +| **Total** | | **~1000** | | | + +--- + +## DedupS3Store Implementation (Phase 1 Draft) + +```rust +use super::s3_fs::S3Vfs; +use super::{VfsDedupConfig, VfsError}; +use sha2::{Sha256, Digest}; +use std::path::Path; + +pub struct DedupS3Store { + s3vfs: S3Vfs, + bucket: String, + block_prefix: String, + manifest_prefix: String, + config: VfsDedupConfig, +} + +impl DedupS3Store { + pub fn new( + endpoint: &str, + region: &str, + bucket: &str, + access_key: &str, + secret_key: &str, + config: VfsDedupConfig, + ) -> Result { + let s3vfs = S3Vfs::new(endpoint, region, bucket, access_key, secret_key)?; + Ok(Self { + s3vfs, + bucket: bucket.to_string(), + block_prefix: "blocks/".to_string(), + manifest_prefix: "manifests/".to_string(), + config, + }) + } + + pub fn store_block(&self, data: &[u8]) -> Result { + if data.len() > self.config.block_size { + return Err(VfsError::Io(format!("Block size exceeds limit"))); + } + + let hash = Self::hash_block(data); + let key = format!("{}{}", self.block_prefix, hash); + + // Check if block exists + if !self.s3vfs.object_exists(&key)? { + // PUT with initial ref count = 1 + self.s3vfs.put_object_with_metadata( + &key, + data, + [("x-amz-meta-ref-count", "1")] + )?; + } else { + // Increment ref count + self.increment_ref(&hash)?; + } + + Ok(hash) + } + + pub fn get_block(&self, hash: &str) -> Result, VfsError> { + let key = format!("{}{}", self.block_prefix, hash); + self.s3vfs.get_object(&key) + } + + pub fn increment_ref(&self, hash: &str) -> Result<(), VfsError> { + let key = format!("{}{}", self.block_prefix, hash); + let head = self.s3vfs.head_object(&key)?; + + let current_ref = head.metadata + .get("x-amz-meta-ref-count") + .and_then(|v| v.parse::().ok()) + .unwrap_or(1); + + // Need to GET block data + PUT with new metadata + let block_data = self.get_block(hash)?; + self.s3vfs.put_object_with_metadata( + &key, + &block_data, + [("x-amz-meta-ref-count", (current_ref + 1).to_string())] + )?; + + Ok(()) + } + + pub fn dedup_file(&self, source: &Path) -> Result { + let mut file = std::fs::File::open(source)?; + let mut manifest = DedupManifest::new(); + let mut buffer = vec![0u8; self.config.block_size]; + + loop { + let n = file.read(&mut buffer)?; + if n == 0 { break; } + + manifest.original_size += n; + let hash = self.store_block(&buffer[..n])?; + manifest.block_hashes.push(hash); + } + + // Store manifest to S3 + let file_id = uuid::Uuid::new_v4().to_string(); + manifest.file_id = file_id; + let manifest_key = format!("{}{}.json", self.manifest_prefix, file_id); + let manifest_json = serde_json::to_string(&manifest)?; + self.s3vfs.put_object(&manifest_key, manifest_json.as_bytes())?; + + Ok(manifest) + } + + pub fn restore_file(&self, manifest_id: &str, target: &Path) -> Result<(), VfsError> { + let manifest_key = format!("{}{}.json", self.manifest_prefix, manifest_id); + let manifest_json = self.s3vfs.get_object(&manifest_key)?; + let manifest: DedupManifest = serde_json::from_slice(&manifest_json)?; + + let mut file = std::fs::File::create(target)?; + for hash in &manifest.block_hashes { + let block = self.get_block(hash)?; + file.write_all(&block)?; + } + + Ok(()) + } + + fn hash_block(data: &[u8]) -> String { + let mut hasher = Sha256::new(); + hasher.update(data); + hex::encode(hasher.finalize()) + } +} +``` + +--- + +## Integration with MarkBase VFS + +### Option 1: Standalone DedupS3Store + +用户手动创建 DedupS3Store: +```bash +# CLI tool +markbase dedup-upload --s3 --s3-endpoint http://localhost:9000 --file /data/large.iso +markbase dedup-download --s3 --manifest-id --output /data/restored.iso +``` + +--- + +### Option 2: DedupVfsBackend (VfsBackend trait) + +创建 VfsBackend wrapper,自动 dedup: +```rust +pub struct DedupS3Backend { + dedup_store: DedupS3Store, + manifest_dir: PathBuf, // Local cache for manifests +} + +impl VfsBackend for DedupS3Backend { + fn open_file(&self, path: &Path, flags: &OpenFlags) -> Result, VfsError> { + // 1. Read manifest from S3 + let manifest = self.load_manifest(path)?; + + // 2. DedupS3File (read blocks from S3) + Ok(Box::new(DedupS3File::new(self.dedup_store.clone(), manifest))) + } + + fn stat(&self, path: &Path) -> Result { + // Read from manifest metadata + let manifest = self.load_manifest(path)?; + Ok(VfsStat { + size: manifest.original_size, + mtime: manifest.mtime, + ... + }) + } + + fn read_dir(&self, path: &Path) -> Result, VfsError> { + // List manifests from S3 + self.dedup_store.s3vfs.list_objects(&self.manifest_prefix) + } +} +``` + +**优势**: +- ✅ 透明 dedup(用户无需关心) +- ✅ 与 SMB/WebDAV/SFTP 无缝集成 + +--- + +### Option 3: Hybrid (LocalFs + DedupS3Store) + +```rust +pub struct HybridDedupBackend { + local: LocalFs, // Small files (<1MB) 存本地 + dedup_s3: DedupS3Store, // Large files (>1MB) dedup to S3 +} + +impl VfsBackend for HybridDedupBackend { + fn open_file(&self, path: &Path, flags: &OpenFlags) -> Result, VfsError> { + // Check file size + let stat = self.local.stat(path)?; + + if stat.size < self.dedup_s3.config.min_file_size { + // Small file: direct LocalFs + self.local.open_file(path, flags) + } else { + // Large file: dedup to S3 + self.dedup_s3.dedup_file(path)?; + self.dedup_s3.open_file_from_manifest(path) + } + } +} +``` + +**推荐**:Option 1(Phase 1),Option 3(Phase 2)。 + +--- + +## Performance Considerations + +### Network Latency + +| Operation | LocalFs | S3Vfs | Overhead | +|-----------|---------|-------|----------| +| store_block (4KB) | ~0.1ms | ~5-10ms (HTTP) | ~50-100x | +| get_block (4KB) | ~0.1ms | ~5-10ms (HTTP) | ~50-100x | +| dedup_file (100MB) | ~2s (25MB/s) | ~10s (10MB/s) | ~5x | + +**缓解方案**: +- ✅ Async concurrent upload(4-8 并发) +- ✅ ReadCache(64MB cache) +- ✅ Local cache for hot blocks + +--- + +### Dedup Ratio Impact + +| File Type | Dedup Ratio | Network Traffic Saved | +|-----------|-------------|----------------------| +| VM images (similar OS) | ~80% | -80% upload bandwidth | +| Log files (daily) | ~60% | -60% upload bandwidth | +| Unique files (photos) | ~5% | -5% upload bandwidth | + +--- + +## Next Steps + +1. **Phase 1 Implementation** (~300 lines) + - `DedupS3Store` struct + - `store_block()` / `get_block()` via S3Vfs + - `increment_ref()` with metadata update + +2. **Phase 2 CLI Integration** (~100 lines) + - `markbase dedup-upload --s3` + - `markbase dedup-download --manifest-id` + +3. **Phase 3 Performance Test** + - Benchmark dedup_file (100MB) + - Compare LocalFs vs S3Vfs + +--- + +## Open Questions + +1. **Concurrency**: Accept non-atomic ref count vs implement versioning? +2. **Backend choice**: Standalone CLI vs VfsBackend integration? +3. **Min versioning**: Should we require MinIO versioning enabled? +4. **Ref count object**: Metadata vs separate object? +5. **Block cache**: Should we cache blocks locally? + +--- + +**文档创建**: 2026-06-25 +**最后更新**: 2026-06-25 \ No newline at end of file diff --git a/docs/MINIO_INTEGRATION.md b/docs/MINIO_INTEGRATION.md new file mode 100644 index 0000000..a40f6c4 --- /dev/null +++ b/docs/MINIO_INTEGRATION.md @@ -0,0 +1,382 @@ +# MinIO Integration Guide for MarkBase + +**Date**: 2026-06-25 +**Status**: Ready for deployment +**Backend**: S3Vfs (已有实现,无需修改代码) + +--- + +## Executive Summary + +MinIO 是高性能、S3-compatible 的对象存储服务,完美契合 MarkBase 的定位: +- ✅ 跨平台支持(macOS/Linux/Windows) +- ✅ 轻量级部署(单节点即可) +- ✅ 已有 S3Vfs 支持(无需修改代码) +- ✅ 高性能(纠删码 + 分布式扩展) + +--- + +## MinIO vs Ceph RADOS Comparison + +| Aspect | MinIO | Ceph RADOS | +|--------|-------|------------| +| **Platform** | ✅ 全平台 | ❌ Linux-only | +| **Deployment** | ⚠️⚠️ 单节点即可 | ⚠️⚠️⚠️⚠️⚠️ 需完整集群 | +| **API** | ✅ S3-compatible HTTP | ❌ librados FFI | +| **Code change** | ✅ 0 行(已有 S3Vfs) | ❌ ~1350 行 | +| **Positioning** | ⭐⭐⭐⭐⭐ 完全匹配 | ❌ 不符合 Lightweight 定位 | + +--- + +## MinIO Deployment + +### macOS 单节点部署 + +```bash +# 安装 MinIO +brew install minio/stable/minio + +# 启动 MinIO server +minio server /path/to/data --console-address ":9001" + +# 输出: +# Endpoint: http://192.168.1.100:9000 http://127.0.0.1:9000 +# Console: http://192.168.1.100:9001 http://127.0.0.1:9001 +# AccessKey: minioadmin +# SecretKey: minioadmin +``` + +### Linux 生产部署 + +```bash +# Docker 单节点 +docker run -d \ + --name minio \ + -p 9000:9000 \ + -p 9001:9001 \ + -v /data/minio:/data \ + minio/minio server /data --console-address ":9001" + +# 分布式集群(4节点) +docker run -d \ + --name minio \ + -p 9000:9000 \ + -p 9001:9001 \ + -v /data1:/data1 \ + -v /data2:/data2 \ + minio/minio server http://node1/data1 http://node2/data2 http://node3/data1 http://node4/data2 --console-address ":9001" +``` + +### Kubernetes 部署(推荐生产) + +```yaml +# minio-deployment.yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: minio +spec: + replicas: 4 + selector: + matchLabels: + app: minio + template: + metadata: + labels: + app: minio + spec: + containers: + - name: minio + image: minio/minio:latest + args: + - server + - http://minio-0/data http://minio-1/data http://minio-2/data http://minio-3/data + - --console-address + - ":9001" + ports: + - containerPort: 9000 + - containerPort: 9001 + volumeMounts: + - name: data + mountPath: /data + volumes: + - name: data + emptyDir: {} +``` + +--- + +## MarkBase S3Vfs Integration + +### 配置方式 + +**环境变量**: +```bash +export MB_S3_ENDPOINT=http://localhost:9000 +export MB_S3_REGION=us-east-1 +export MB_S3_BUCKET=markbase +export MB_S3_ACCESS_KEY=minioadmin +export MB_S3_SECRET_KEY=minioadmin +``` + +**配置文件**(`config/s3.toml`): +```toml +[s3] +enabled = true +endpoint = "http://localhost:9000" +region = "us-east-1" +bucket = "markbase" +access_key = "minioadmin" +secret_key = "minioadmin" + +[s3.webdav] +# WebDAV 使用 S3 后端 +enabled = true +user = "demo" +root_prefix = "webdav/" +``` + +### S3Vfs 使用示例 + +**WebDAV + MinIO**: +```bash +# 启动 WebDAV server(使用 MinIO 后端) +cargo run -- webdav-start \ + --user demo \ + --port 8002 \ + --s3 \ + --s3-endpoint http://localhost:9000 \ + --s3-bucket markbase \ + --s3-access-key minioadmin \ + --s3-secret-key minioadmin \ + --s3-region us-east-1 \ + --root webdav/ +``` + +**SMB + MinIO**(通过 VFS backend): +```bash +# 启动 SMB server(使用 MinIO 后端) +cargo run --features smb-server -- smb-start \ + --port 4445 \ + --share-name files \ + --s3 \ + --s3-endpoint http://localhost:9000 \ + --s3-bucket markbase \ + --s3-access-key minioadmin \ + --s3-secret-key minioadmin \ + --s3-region us-east-1 \ + --root smb/ +``` + +--- + +## MinIO Bucket Management + +### 创建 Bucket + +```bash +# 使用 MinIO client (mc) +mc alias set myminio http://localhost:9000 minioadmin minioadmin +mc mb myminio/markbase + +# 使用 AWS CLI +aws --endpoint-url http://localhost:9000 s3 mb s3://markbase +``` + +### 设置 Bucket Policy + +```bash +# 公开读取 policy(用于 public shares) +mc anonymous set download myminio/markbase/public + +# 私有 policy(默认) +mc anonymous set none myminio/markbase/private +``` + +### 设置 Bucket Quota + +```bash +# 设置 quota(MinIO 企业版功能) +mc admin bucket quota myminio/markbase 10GB +``` + +--- + +## MinIO Features Relevant to MarkBase + +| Feature | Description | MarkBase Use Case | +|---------|-------------|-------------------| +| **Erasure Coding** | 数据冗余(默认 EC:2) | 自动容错,类似 RAID | +| **Versioning** | 对象版本控制 | 可替代 Snapshot 功能 | +| **Bucket Policy** | ACL 管理 | 用户权限控制 | +| **Lifecycle Rules** | 自动过期 | 旧 backup 清理 | +| **Object Lock** | WORM 模式 | 合规性备份保护 | +| **Replication** | 跨站点复制 | Disaster recovery | + +### Versioning(替代 Snapshot) + +```bash +# 启用 versioning +mc version enable myminio/markbase + +# 列出对象版本 +mc ls --versions myminio/markbase/file.txt + +# 恢复旧版本 +mc cp myminio/markbase/file.txt#version-id myminio/markbase/file.txt +``` + +### Lifecycle Rules(Backup 清理) + +```bash +# 设置 30 天后自动删除 +mc ilm add myminio/markbase --expire-days 30 +``` + +--- + +## Performance Optimization + +### MinIO 性能参数 + +```bash +# 高性能配置 +minio server /data \ + --console-address ":9001" \ + --parallel 8 \ + --cache /cache:1000 +``` + +### S3Vfs 性能优化 + +**并发上传**(已在 S3Vfs 实现): +- Multipart upload(大于 5MB 自动分片) +- 并发上传分片(默认 4 并发) + +**缓存**: +- ReadCache: 64MB, 64KB blocks, 5min TTL(已在 cache.rs 实现) +- WriteCache: 32MB(已在 cache.rs 实现) + +--- + +## Docker Compose Example + +```yaml +version: '3' +services: + minio: + image: minio/minio:latest + command: server /data --console-address ":9001" + ports: + - "9000:9000" + - "9001:9001" + volumes: + - minio-data:/data + environment: + - MINIO_ROOT_USER=minioadmin + - MINIO_ROOT_PASSWORD=minioadmin + + markbase-webdav: + build: . + command: webdav-start --user demo --port 8002 --s3 --s3-endpoint http://minio:9000 --s3-bucket markbase --s3-access-key minioadmin --s3-secret-key minioadmin + ports: + - "8002:8002" + environment: + - MB_S3_ENDPOINT=http://minio:9000 + depends_on: + - minio + +volumes: + minio-data: +``` + +--- + +## Integration Checklist + +| Task | Status | Notes | +|------|--------|-------| +| **MinIO 部署** | ⏳ User action | macOS/Linux/Docker | +| **创建 Bucket** | ⏳ User action | `mc mb myminio/markbase` | +| **S3Vfs 配置** | ✅ 已支持 | 无需修改代码 | +| **WebDAV + S3** | ✅ 已支持 | CLI 参数已实现 | +| **SMB + S3** | ✅ 已支持 | CLI 参数已实现 | +| **SFTP + S3** | ⏳ 待实现 | 需要 SFTP S3 backend | +| **Backup to S3** | ✅ 已支持 | BackupManifest + S3Vfs | + +--- + +## Troubleshooting + +### MinIO 连接问题 + +```bash +# 检查 MinIO status +mc admin info myminio + +# 检查 endpoint 连接 +curl -I http://localhost:9000/minio/health/live +``` + +### S3Vfs 错误 + +**常见错误**: +- `VfsError::NotFound` → Bucket 或 object 不存在 +- `VfsError::PermissionDenied` → Access key/secret key 错误 +- `VfsError::Io("S3 PUT failed: 403")` → Bucket policy 拒绝写入 + +**调试方法**: +```bash +# 查看 MinIO logs +docker logs minio + +# 使用 mc 测试 +mc cp test.txt myminio/markbase/test.txt +mc ls myminio/markbase/ +``` + +--- + +## MinIO vs S3Vfs Feature Mapping + +| VfsBackend Method | MinIO S3 API | Status | +|-------------------|--------------|--------| +| `read_dir()` | ListObjectsV2 | ✅ | +| `open_file()` | GetObject / PutObject | ✅ | +| `stat()` | HeadObject | ✅ | +| `create_dir()` | PutObject (0-byte) | ✅ | +| `remove_dir()` | DeleteObject | ✅ | +| `remove_file()` | DeleteObject | ✅ | +| `rename()` | CopyObject + DeleteObject | ✅ | +| `exists()` | HeadObject | ✅ | +| `copy()` | CopyObject | ✅ | +| `hard_link()` | CopyObject | ✅ | +| `create_snapshot()` | Versioning | ⚠️ 需启用 versioning | +| `list_snapshots()` | ListObjectVersions | ⚠️ 需实现 | +| `set_quota()` | Bucket quota | ⚠️ MinIO 企业版 | +| `set_acl()` | Bucket policy | ⚠️ 需实现 | + +--- + +## Next Steps + +1. **部署 MinIO**(用户 action) + - macOS: `brew install minio && minio server /data` + - Docker: `docker run minio/minio server /data` + +2. **创建 Bucket**(用户 action) + - `mc alias set myminio http://localhost:9000 minioadmin minioadmin` + - `mc mb myminio/markbase` + +3. **配置 MarkBase** + - 设置 `MB_S3_*` 环境变量 + - 或使用 CLI 参数 `--s3 --s3-endpoint ...` + +4. **测试连接** + - WebDAV: `curl -X PROPFIND http://localhost:8002/webdav/` + - SMB: `smbclient -p 4445 -L localhost` + +--- + +**文档创建**: 2026-06-25 +**最后更新**: 2026-06-25 \ No newline at end of file