Files
markbase/docs/CEPH_INTEGRATION_ANALYSIS.md

14 KiB
Raw Blame History

Ceph RADOS Integration Analysis for MarkBase

Date: 2026-06-25 Status: Shelved (不符合 macOS 跨平台定位) Library: ceph-async (4.0.5) Constraint: Linux-only (requires librados.so symlink)


Executive Summary

Goal

Add Ceph RADOS as a VfsBackend option for distributed, highly scalable storage.

Key Findings

Aspect Finding
Platform Linux-only (librados.so FFI, macOS needs Docker/VM)
Deployment ⚠️ Requires full cluster (Monitor + OSD + MGR)
Complexity ⚠️⚠️⚠️⚠️⚠️ High (超出 Lightweight 定位)
Positioning 不符合 MarkBase macOS 跨平台定位

Recommendation

当前搁置。优先考虑:

  1. MinIO — S3-compatible已有 S3Vfs 支持,跨平台
  2. 内置分布式 — DedupFs + S3Vfs 组合,轻量级

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────┐
│ MarkBase Application Layer                                              │
│ ├── SMB Server (Port 4445)                                              │
│ ├── SFTP Server (Port 2024)                                             │
│ ├── WebDAV Server (Port 11438)                                          │
│ └───────────────────────────────────────────────────────────────────────┘
│                    ↓                                                    │
┌─────────────────────────────────────────────────────────────────────────┐
│ VFS Abstraction Layer (VfsBackend trait)                                │
│ ├── LocalFs      — POSIX local filesystem                               │
│ ├── S3Vfs        — S3-compatible storage (HTTP API)                     │
│ ├── SmbVfs       — SMB client backend                                   │
│ ├── CephVfs      — Ceph RADOS backend (搁置)                            │
│ ├── EncryptedFs  — Encryption layer                                     │
│ ├── Compression  — ZSTD/LZ4 compression layer                           │
│ ├── DedupFs      — Block deduplication layer                            │
│ ├── RaidFs       — RAID-Z emulation layer                               │
│ └─────────────────────────────────────────────────────────────────────┘
│                    ↓                                                    │
┌─────────────────────────────────────────────────────────────────────────┐
│ Ceph Storage Cluster (RADOS)                                            │
│ ├── Monitor (MON)    — Cluster map, authentication                     │
│ ├── OSD Daemons      — Object storage (data replication)               │
│ ├── Manager (MGR)    — Dashboard, telemetry                            │
│ ├── MDS (optional)   — CephFS metadata server                          │
│ ├── RGW (optional)   — S3/Swift gateway                                │
│ └─────────────────────────────────────────────────────────────────────┘

Library Analysis

Rust Ceph Crates

Crate Version Description Platform
ceph 3.2.5 Official librados FFI (sync) Linux-only
ceph-async 4.0.5 Async librados FFI (futures 0.3) Linux-only
ceph-rbd 0.3.2 RADOS Block Device bindings Linux-only

ceph-async Module Structure

ceph_async::
├── CephClient         — Admin operations (OSD/Pool/Mon commands)
├── rados::            — Low-level FFI bindings (100+ functions)
│   ├── rados_read/write/stat/remove       — Object I/O
│   ├── rados_pool_create/delete/lookup    — Pool management
│   ├── rados_ioctx_*                      — I/O context (pool handle)
│   ├── rados_snap_*                       — Snapshot management
│   ├── rados_lock_*                       — Distributed locking
│   ├── rados_aio_*                        — Async I/O
│   ├── rados_omap_*                       — Key-value store per object
│   └── rados_write_op_* / rados_read_op_* — Compound operations
├── completion::       — Async completion handling
├── read_stream::      — Async read stream
├── write_sink::       — Async write sink
└── list_stream::      — Async object listing

CephClient API

let client = CephClient::new("admin", "/etc/ceph/ceph.conf")?;

// OSD operations
client.osd_tree()?;                // Get OSD tree (CRUSH map)
client.osd_out(osd_id)?;           // Mark OSD out
client.osd_crush_remove(osd_id)?;  // Remove from CRUSH map

// Pool operations
client.osd_pool_get(pool, option)?;  // Get pool config
client.osd_pool_set(pool, key, val)?; // Set pool config
client.osd_pool_quota_get(pool)?;     // Get pool quota

// Cluster status
client.status()?;     // Cluster health
client.mon_dump()?;   // Monitor list
client.version()?;    // Ceph version

Implementation Phases

Phase Task Code Lines Priority Risk Dependencies
Phase 1 CephVfs struct + basic I/O ~400 P0 Medium ⚠️⚠️⚠️ ceph-async crate
Phase 2 Pool management CLI ~150 P1 Low ⚠️ Phase 1
Phase 3 Snapshot support ~200 P2 Medium ⚠️⚠️⚠️ librados snap API
Phase 4 Distributed locking ~100 P2 Medium ⚠️⚠️⚠️ librados lock API
Phase 5 OMAP key-value ~150 P3 Low ⚠️ librados omap API
Phase 6 Async integration ~300 P1 High ⚠️⚠️⚠️⚠️ async-vfs feature
Phase 7 Docker test environment ~50 P0 Low ⚠️ Docker compose
Phase 8 Performance benchmark ~100 P2 Low ⚠️ Benchmark scripts
Total ~1350

Phase 1: CephVfs Core Implementation

Key Design Decisions

1. Object vs File mapping:

  • RADOS is object storage (no directories)
  • Path /foo/bar.txt → Object foo/bar.txt in pool
  • Directories simulated via zero-byte objects with / suffix (like S3)

2. Pool-per-share vs single pool:

  • Option A: Single pool + path prefix (simpler, less isolation)
  • Option B: Pool-per-share (better isolation, quota per pool)
  • Recommend: Option B (pool-per-share) for enterprise use

3. I/O context caching:

  • Each pool requires separate rados_ioctx_t
  • Cache ioctx per share to avoid recreation overhead

CephVfs Struct (Draft)

pub struct CephVfs {
    cluster: rados_t,            // RADOS cluster handle
    pool_name: String,           // Pool name for this share
    ioctx: rados_ioctx_t,        // I/O context (cached)
    root_prefix: String,         // Path prefix within pool
}

pub struct CephVfsFile {
    ioctx: rados_ioctx_t,
    object_id: String,           // Object name in pool
    position: u64,
    write_buffer: Vec<u8>,       // Buffer for writes (flush on close)
    size: u64,
}

VfsBackend Method Mapping

Method RADOS equivalent Complexity
read_dir() rados_nobjects_list_* High (pagination)
open_file() Custom (object ops) Medium
stat() rados_stat() Low
create_dir() rados_write_full(0-byte) Low
remove_dir() rados_remove() Low
remove_file() rados_remove() Low
rename() Custom (copy + delete) Medium
exists() rados_stat() Low
copy() rados_clone_range() Low
hard_link() rados_clone_range() Low
read_link() Unsupported N/A
create_symlink() Unsupported N/A

Risk Assessment

Risk Level Mitigation
Linux-only ⚠️⚠️⚠️⚠️⚠️ Critical Docker/VM for macOS; 不符合跨平台定位
librados.so symlink ⚠️⚠️⚠️ Medium Document setup; CI check
Pool-level snapshots ⚠️⚠️ Low Document limitation; consider RGW
Async overhead ⚠️⚠️⚠️ Medium Benchmark; spawn_blocking wrapper
Cluster complexity ⚠️⚠️⚠️⚠️⚠️ Critical 超出 Lightweight 定位; Docker compose
SMB Oplocks integration ⚠️⚠️⚠️ Medium RADOS locking API; careful design

Alternatives (推荐方案)

方案对比

方案 跨平台 部署复杂度 定位匹配 状态
Ceph RADOS Linux-only ⚠️⚠️⚠️⚠️⚠️ 极高 不匹配 搁置
Ceph RGW (S3) HTTP API ⚠️⚠️⚠️⚠️ 中等 已有 S3Vfs
MinIO 全平台 ⚠️⚠️ 完全匹配 已有 S3Vfs
GlusterFS POSIX ⚠️⚠️⚠️ 待研究
内置分布式 全平台 ⚠️⚠️ 完全匹配 已有基础

方案 1: MinIO (推荐)

优势:

  • S3-compatible API已有 S3Vfs无需新代码
  • 单节点部署(轻量级)
  • 跨平台macOS/Linux/Windows
  • 高性能(纠删码)
  • 开源 + 企业版

部署:

# macOS 单节点
minio server /data --console-address ":9001"

# MarkBase 配置
MB_S3_ENDPOINT=http://localhost:9000
MB_S3_BUCKET=markbase

集成: 无需修改代码S3Vfs 已支持。


方案 2: 内置分布式存储

已有基础:

功能 文件 分布式潜力
DedupFs dedup.rs SHA-256 块存储可跨节点共享
RaidFs raid.rs ⚠️ 单节点 RAID-Z
Send-Receive send_receive.rs ⚠️ 类似 ZFS send/receive
Checksum checksum.rs 数据完整性验证
Compression compression.rs ZSTD 压缩

扩展方向:

  1. DedupFs + S3Vfs: Dedup 块存储到 MinIO/S3跨节点共享
  2. Checksum + Replication: 增加跨节点复制
  3. Send-Receive + Remote: 增加远程 replication

Technical Details

librados API Functions

Object I/O:

  • rados_read(ioctx, oid, buf, len, offset) — Read at offset
  • rados_write(ioctx, oid, buf, len, offset) — Write at offset
  • rados_write_full(ioctx, oid, buf, len) — Write entire object
  • rados_append(ioctx, oid, buf, len) — Append to object
  • rados_stat(ioctx, oid, psize, pmtime) — Get object size/mtime
  • rados_remove(ioctx, oid) — Delete object

Pool Operations:

  • rados_pool_create(cluster, pool_name) — Create pool
  • rados_pool_delete(cluster, pool_name) — Delete pool
  • rados_pool_lookup(cluster, pool_name) — Find pool ID
  • rados_ioctx_create(cluster, pool_name, ioctx) — Create I/O context

Snapshots:

  • rados_ioctx_snap_create(ioctx, snap_name) — Create pool snapshot
  • rados_ioctx_snap_list(ioctx, snaps) — List snapshots
  • rados_ioctx_snap_remove(ioctx, snap_id) — Delete snapshot
  • rados_ioctx_snap_rollback(ioctx, oid, snap_id) — Rollback object

Locking:

  • rados_lock_exclusive(ioctx, oid, name, cookie, desc, duration, flags) — Exclusive lock
  • rados_lock_shared(ioctx, oid, name, cookie, tag, desc, duration, flags) — Shared lock
  • rados_unlock(ioctx, oid, name, cookie) — Release lock
  • rados_list_lockers(ioctx, oid, name, ...) — List lock holders

OMAP (Key-Value):

  • rados_omap_set(ioctx, oid, map) — Set key-value pairs
  • rados_omap_get(ioctx, oid, ...) — Get values by keys
  • rados_omap_get_keys(ioctx, oid, ...) — List keys
  • rados_omap_rm_keys(ioctx, oid, keys) — Delete keys

Async I/O:

  • rados_aio_read(ioctx, oid, completion, buf, len, offset) — Async read
  • rados_aio_write(ioctx, oid, completion, buf, len, offset) — Async write
  • rados_aio_flush(ioctx) — Flush pending async ops
  • rados_aio_wait_for_complete(completion) — Wait for completion

Open Questions

  1. 部署目标: Linux-only production vs macOS development?
  2. Backend choice: RADOS (librados) vs RGW (S3 API)?
  3. Pool strategy: Pool-per-share vs single pool + path prefix?
  4. SMB Oplocks: Should CephVfs support SMB Oplocks via RADOS locking?
  5. Priority: Start with basic I/O or full async integration first?

Conclusion

当前搁置 Ceph RADOS 集成,原因:

  1. Linux-only 约束不符合 macOS 跨平台定位
  2. ⚠️ 部署复杂度超出 Lightweight 定位
  3. ⚠️ 需要完整 Ceph 集群Monitor + OSD + MGR

推荐替代方案

  1. MinIO — S3-compatible已有 S3Vfs轻量级
  2. 内置分布式 — DedupFs + S3Vfs 组合

后续行动

  • MinIO 集成文档0 行代码)
  • DedupFs + S3Vfs 组合研究(~100 行)
  • 内置 Replication 功能(~400 行)

文档创建: 2026-06-25 最后更新: 2026-06-25