Distributed storage research: Ceph (shelved) + MinIO guide + DedupS3 design
This commit is contained in:
328
docs/CEPH_INTEGRATION_ANALYSIS.md
Normal file
328
docs/CEPH_INTEGRATION_ANALYSIS.md
Normal file
@@ -0,0 +1,328 @@
|
||||
# Ceph RADOS Integration Analysis for MarkBase
|
||||
|
||||
**Date**: 2026-06-25
|
||||
**Status**: Shelved (不符合 macOS 跨平台定位)
|
||||
**Library**: ceph-async (4.0.5)
|
||||
**Constraint**: Linux-only (requires librados.so symlink)
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
### Goal
|
||||
Add Ceph RADOS as a VfsBackend option for distributed, highly scalable storage.
|
||||
|
||||
### Key Findings
|
||||
| Aspect | Finding |
|
||||
|--------|---------|
|
||||
| **Platform** | ❌ Linux-only (librados.so FFI, macOS needs Docker/VM) |
|
||||
| **Deployment** | ⚠️ Requires full cluster (Monitor + OSD + MGR) |
|
||||
| **Complexity** | ⚠️⚠️⚠️⚠️⚠️ High (超出 Lightweight 定位) |
|
||||
| **Positioning** | ❌ 不符合 MarkBase macOS 跨平台定位 |
|
||||
|
||||
### Recommendation
|
||||
**当前搁置**。优先考虑:
|
||||
1. **MinIO** — S3-compatible,已有 S3Vfs 支持,跨平台
|
||||
2. **内置分布式** — DedupFs + S3Vfs 组合,轻量级
|
||||
|
||||
---
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ MarkBase Application Layer │
|
||||
│ ├── SMB Server (Port 4445) │
|
||||
│ ├── SFTP Server (Port 2024) │
|
||||
│ ├── WebDAV Server (Port 11438) │
|
||||
│ └───────────────────────────────────────────────────────────────────────┘
|
||||
│ ↓ │
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ VFS Abstraction Layer (VfsBackend trait) │
|
||||
│ ├── LocalFs — POSIX local filesystem │
|
||||
│ ├── S3Vfs — S3-compatible storage (HTTP API) │
|
||||
│ ├── SmbVfs — SMB client backend │
|
||||
│ ├── CephVfs — Ceph RADOS backend (搁置) │
|
||||
│ ├── EncryptedFs — Encryption layer │
|
||||
│ ├── Compression — ZSTD/LZ4 compression layer │
|
||||
│ ├── DedupFs — Block deduplication layer │
|
||||
│ ├── RaidFs — RAID-Z emulation layer │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘
|
||||
│ ↓ │
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ Ceph Storage Cluster (RADOS) │
|
||||
│ ├── Monitor (MON) — Cluster map, authentication │
|
||||
│ ├── OSD Daemons — Object storage (data replication) │
|
||||
│ ├── Manager (MGR) — Dashboard, telemetry │
|
||||
│ ├── MDS (optional) — CephFS metadata server │
|
||||
│ ├── RGW (optional) — S3/Swift gateway │
|
||||
│ └─────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Library Analysis
|
||||
|
||||
### Rust Ceph Crates
|
||||
|
||||
| Crate | Version | Description | Platform |
|
||||
|-------|---------|-------------|----------|
|
||||
| `ceph` | 3.2.5 | Official librados FFI (sync) | Linux-only |
|
||||
| `ceph-async` | 4.0.5 | Async librados FFI (futures 0.3) | Linux-only |
|
||||
| `ceph-rbd` | 0.3.2 | RADOS Block Device bindings | Linux-only |
|
||||
|
||||
### ceph-async Module Structure
|
||||
|
||||
```
|
||||
ceph_async::
|
||||
├── CephClient — Admin operations (OSD/Pool/Mon commands)
|
||||
├── rados:: — Low-level FFI bindings (100+ functions)
|
||||
│ ├── rados_read/write/stat/remove — Object I/O
|
||||
│ ├── rados_pool_create/delete/lookup — Pool management
|
||||
│ ├── rados_ioctx_* — I/O context (pool handle)
|
||||
│ ├── rados_snap_* — Snapshot management
|
||||
│ ├── rados_lock_* — Distributed locking
|
||||
│ ├── rados_aio_* — Async I/O
|
||||
│ ├── rados_omap_* — Key-value store per object
|
||||
│ └── rados_write_op_* / rados_read_op_* — Compound operations
|
||||
├── completion:: — Async completion handling
|
||||
├── read_stream:: — Async read stream
|
||||
├── write_sink:: — Async write sink
|
||||
└── list_stream:: — Async object listing
|
||||
```
|
||||
|
||||
### CephClient API
|
||||
|
||||
```rust
|
||||
let client = CephClient::new("admin", "/etc/ceph/ceph.conf")?;
|
||||
|
||||
// OSD operations
|
||||
client.osd_tree()?; // Get OSD tree (CRUSH map)
|
||||
client.osd_out(osd_id)?; // Mark OSD out
|
||||
client.osd_crush_remove(osd_id)?; // Remove from CRUSH map
|
||||
|
||||
// Pool operations
|
||||
client.osd_pool_get(pool, option)?; // Get pool config
|
||||
client.osd_pool_set(pool, key, val)?; // Set pool config
|
||||
client.osd_pool_quota_get(pool)?; // Get pool quota
|
||||
|
||||
// Cluster status
|
||||
client.status()?; // Cluster health
|
||||
client.mon_dump()?; // Monitor list
|
||||
client.version()?; // Ceph version
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
| Phase | Task | Code Lines | Priority | Risk | Dependencies |
|
||||
|-------|------|------------|----------|------|--------------|
|
||||
| **Phase 1** | CephVfs struct + basic I/O | ~400 | P0 | Medium ⚠️⚠️⚠️ | ceph-async crate |
|
||||
| **Phase 2** | Pool management CLI | ~150 | P1 | Low ⚠️ | Phase 1 |
|
||||
| **Phase 3** | Snapshot support | ~200 | P2 | Medium ⚠️⚠️⚠️ | librados snap API |
|
||||
| **Phase 4** | Distributed locking | ~100 | P2 | Medium ⚠️⚠️⚠️ | librados lock API |
|
||||
| **Phase 5** | OMAP key-value | ~150 | P3 | Low ⚠️ | librados omap API |
|
||||
| **Phase 6** | Async integration | ~300 | P1 | High ⚠️⚠️⚠️⚠️ | async-vfs feature |
|
||||
| **Phase 7** | Docker test environment | ~50 | P0 | Low ⚠️ | Docker compose |
|
||||
| **Phase 8** | Performance benchmark | ~100 | P2 | Low ⚠️ | Benchmark scripts |
|
||||
| **Total** | | **~1350** | | | |
|
||||
|
||||
---
|
||||
|
||||
## Phase 1: CephVfs Core Implementation
|
||||
|
||||
### Key Design Decisions
|
||||
|
||||
**1. Object vs File mapping**:
|
||||
- RADOS is object storage (no directories)
|
||||
- Path `/foo/bar.txt` → Object `foo/bar.txt` in pool
|
||||
- Directories simulated via zero-byte objects with `/` suffix (like S3)
|
||||
|
||||
**2. Pool-per-share vs single pool**:
|
||||
- Option A: Single pool + path prefix (simpler, less isolation)
|
||||
- Option B: Pool-per-share (better isolation, quota per pool)
|
||||
- **Recommend**: Option B (pool-per-share) for enterprise use
|
||||
|
||||
**3. I/O context caching**:
|
||||
- Each pool requires separate `rados_ioctx_t`
|
||||
- Cache ioctx per share to avoid recreation overhead
|
||||
|
||||
### CephVfs Struct (Draft)
|
||||
|
||||
```rust
|
||||
pub struct CephVfs {
|
||||
cluster: rados_t, // RADOS cluster handle
|
||||
pool_name: String, // Pool name for this share
|
||||
ioctx: rados_ioctx_t, // I/O context (cached)
|
||||
root_prefix: String, // Path prefix within pool
|
||||
}
|
||||
|
||||
pub struct CephVfsFile {
|
||||
ioctx: rados_ioctx_t,
|
||||
object_id: String, // Object name in pool
|
||||
position: u64,
|
||||
write_buffer: Vec<u8>, // Buffer for writes (flush on close)
|
||||
size: u64,
|
||||
}
|
||||
```
|
||||
|
||||
### VfsBackend Method Mapping
|
||||
|
||||
| Method | RADOS equivalent | Complexity |
|
||||
|--------|-----------------|------------|
|
||||
| `read_dir()` | `rados_nobjects_list_*` | High (pagination) |
|
||||
| `open_file()` | Custom (object ops) | Medium |
|
||||
| `stat()` | `rados_stat()` | Low |
|
||||
| `create_dir()` | `rados_write_full(0-byte)` | Low |
|
||||
| `remove_dir()` | `rados_remove()` | Low |
|
||||
| `remove_file()` | `rados_remove()` | Low |
|
||||
| `rename()` | Custom (copy + delete) | Medium |
|
||||
| `exists()` | `rados_stat()` | Low |
|
||||
| `copy()` | `rados_clone_range()` | Low |
|
||||
| `hard_link()` | `rados_clone_range()` | Low |
|
||||
| `read_link()` | Unsupported | N/A |
|
||||
| `create_symlink()` | Unsupported | N/A |
|
||||
|
||||
---
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
| Risk | Level | Mitigation |
|
||||
|------|-------|------------|
|
||||
| **Linux-only** | ⚠️⚠️⚠️⚠️⚠️ Critical | Docker/VM for macOS; 不符合跨平台定位 |
|
||||
| **librados.so symlink** | ⚠️⚠️⚠️ Medium | Document setup; CI check |
|
||||
| **Pool-level snapshots** | ⚠️⚠️ Low | Document limitation; consider RGW |
|
||||
| **Async overhead** | ⚠️⚠️⚠️ Medium | Benchmark; spawn_blocking wrapper |
|
||||
| **Cluster complexity** | ⚠️⚠️⚠️⚠️⚠️ Critical | 超出 Lightweight 定位; Docker compose |
|
||||
| **SMB Oplocks integration** | ⚠️⚠️⚠️ Medium | RADOS locking API; careful design |
|
||||
|
||||
---
|
||||
|
||||
## Alternatives (推荐方案)
|
||||
|
||||
### 方案对比
|
||||
|
||||
| 方案 | 跨平台 | 部署复杂度 | 定位匹配 | 状态 |
|
||||
|------|--------|-----------|---------|------|
|
||||
| **Ceph RADOS** | ❌ Linux-only | ⚠️⚠️⚠️⚠️⚠️ 极高 | ❌ 不匹配 | 搁置 |
|
||||
| **Ceph RGW (S3)** | ✅ HTTP API | ⚠️⚠️⚠️⚠️ 高 | ⭐⭐⭐ 中等 | 已有 S3Vfs |
|
||||
| **MinIO** | ✅ 全平台 | ⚠️⚠️ 低 | ⭐⭐⭐⭐⭐ 完全匹配 | 已有 S3Vfs |
|
||||
| **GlusterFS** | ✅ POSIX | ⚠️⚠️⚠️ 中 | ⭐⭐⭐⭐ 高 | 待研究 |
|
||||
| **内置分布式** | ✅ 全平台 | ⚠️⚠️ 低 | ⭐⭐⭐⭐⭐ 完全匹配 | 已有基础 |
|
||||
|
||||
### 方案 1: MinIO (推荐)
|
||||
|
||||
**优势**:
|
||||
- ✅ S3-compatible API(已有 S3Vfs,无需新代码)
|
||||
- ✅ 单节点部署(轻量级)
|
||||
- ✅ 跨平台(macOS/Linux/Windows)
|
||||
- ✅ 高性能(纠删码)
|
||||
- ✅ 开源 + 企业版
|
||||
|
||||
**部署**:
|
||||
```bash
|
||||
# macOS 单节点
|
||||
minio server /data --console-address ":9001"
|
||||
|
||||
# MarkBase 配置
|
||||
MB_S3_ENDPOINT=http://localhost:9000
|
||||
MB_S3_BUCKET=markbase
|
||||
```
|
||||
|
||||
**集成**: 无需修改代码,S3Vfs 已支持。
|
||||
|
||||
---
|
||||
|
||||
### 方案 2: 内置分布式存储
|
||||
|
||||
**已有基础**:
|
||||
| 功能 | 文件 | 分布式潜力 |
|
||||
|------|------|-----------|
|
||||
| DedupFs | dedup.rs | ✅ SHA-256 块存储可跨节点共享 |
|
||||
| RaidFs | raid.rs | ⚠️ 单节点 RAID-Z |
|
||||
| Send-Receive | send_receive.rs | ⚠️ 类似 ZFS send/receive |
|
||||
| Checksum | checksum.rs | ✅ 数据完整性验证 |
|
||||
| Compression | compression.rs | ✅ ZSTD 压缩 |
|
||||
|
||||
**扩展方向**:
|
||||
1. DedupFs + S3Vfs: Dedup 块存储到 MinIO/S3(跨节点共享)
|
||||
2. Checksum + Replication: 增加跨节点复制
|
||||
3. Send-Receive + Remote: 增加远程 replication
|
||||
|
||||
---
|
||||
|
||||
## Technical Details
|
||||
|
||||
### librados API Functions
|
||||
|
||||
**Object I/O**:
|
||||
- `rados_read(ioctx, oid, buf, len, offset)` — Read at offset
|
||||
- `rados_write(ioctx, oid, buf, len, offset)` — Write at offset
|
||||
- `rados_write_full(ioctx, oid, buf, len)` — Write entire object
|
||||
- `rados_append(ioctx, oid, buf, len)` — Append to object
|
||||
- `rados_stat(ioctx, oid, psize, pmtime)` — Get object size/mtime
|
||||
- `rados_remove(ioctx, oid)` — Delete object
|
||||
|
||||
**Pool Operations**:
|
||||
- `rados_pool_create(cluster, pool_name)` — Create pool
|
||||
- `rados_pool_delete(cluster, pool_name)` — Delete pool
|
||||
- `rados_pool_lookup(cluster, pool_name)` — Find pool ID
|
||||
- `rados_ioctx_create(cluster, pool_name, ioctx)` — Create I/O context
|
||||
|
||||
**Snapshots**:
|
||||
- `rados_ioctx_snap_create(ioctx, snap_name)` — Create pool snapshot
|
||||
- `rados_ioctx_snap_list(ioctx, snaps)` — List snapshots
|
||||
- `rados_ioctx_snap_remove(ioctx, snap_id)` — Delete snapshot
|
||||
- `rados_ioctx_snap_rollback(ioctx, oid, snap_id)` — Rollback object
|
||||
|
||||
**Locking**:
|
||||
- `rados_lock_exclusive(ioctx, oid, name, cookie, desc, duration, flags)` — Exclusive lock
|
||||
- `rados_lock_shared(ioctx, oid, name, cookie, tag, desc, duration, flags)` — Shared lock
|
||||
- `rados_unlock(ioctx, oid, name, cookie)` — Release lock
|
||||
- `rados_list_lockers(ioctx, oid, name, ...)` — List lock holders
|
||||
|
||||
**OMAP (Key-Value)**:
|
||||
- `rados_omap_set(ioctx, oid, map)` — Set key-value pairs
|
||||
- `rados_omap_get(ioctx, oid, ...)` — Get values by keys
|
||||
- `rados_omap_get_keys(ioctx, oid, ...)` — List keys
|
||||
- `rados_omap_rm_keys(ioctx, oid, keys)` — Delete keys
|
||||
|
||||
**Async I/O**:
|
||||
- `rados_aio_read(ioctx, oid, completion, buf, len, offset)` — Async read
|
||||
- `rados_aio_write(ioctx, oid, completion, buf, len, offset)` — Async write
|
||||
- `rados_aio_flush(ioctx)` — Flush pending async ops
|
||||
- `rados_aio_wait_for_complete(completion)` — Wait for completion
|
||||
|
||||
---
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **部署目标**: Linux-only production vs macOS development?
|
||||
2. **Backend choice**: RADOS (librados) vs RGW (S3 API)?
|
||||
3. **Pool strategy**: Pool-per-share vs single pool + path prefix?
|
||||
4. **SMB Oplocks**: Should CephVfs support SMB Oplocks via RADOS locking?
|
||||
5. **Priority**: Start with basic I/O or full async integration first?
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
**当前搁置 Ceph RADOS 集成**,原因:
|
||||
1. ❌ Linux-only 约束不符合 macOS 跨平台定位
|
||||
2. ⚠️ 部署复杂度超出 Lightweight 定位
|
||||
3. ⚠️ 需要完整 Ceph 集群(Monitor + OSD + MGR)
|
||||
|
||||
**推荐替代方案**:
|
||||
1. ⭐⭐⭐⭐⭐ **MinIO** — S3-compatible,已有 S3Vfs,轻量级
|
||||
2. ⭐⭐⭐⭐⭐ **内置分布式** — DedupFs + S3Vfs 组合
|
||||
|
||||
**后续行动**:
|
||||
- MinIO 集成文档(0 行代码)
|
||||
- DedupFs + S3Vfs 组合研究(~100 行)
|
||||
- 内置 Replication 功能(~400 行)
|
||||
|
||||
---
|
||||
|
||||
**文档创建**: 2026-06-25
|
||||
**最后更新**: 2026-06-25
|
||||
Reference in New Issue
Block a user