optimize the perf and support more features

2026-03-14 11:45:35 +08:00
parent 7e7ebacd9d
commit 00cfac3d24
56 changed files with 6340 additions and 1019 deletions
--- a/docs/PERFORMANCE_OPTIMIZATIONS.md
+++ b/docs/PERFORMANCE_OPTIMIZATIONS.md
@@ -0,0 +1,255 @@
+# Performance Optimizations for gotgt
+
+This document describes the performance optimizations implemented for gotgt, focusing on NUMA-aware memory allocation and io_uring backend storage support.
+
+## Overview
+
+Two major performance optimizations have been implemented:
+
+1. **NUMA-Aware Memory Allocation** - Optimizes memory access patterns on multi-socket systems
+2. **io_uring Backend Storage** - Provides high-performance asynchronous I/O on Linux 5.1+
+
+## 1. NUMA-Aware Memory Allocation
+
+### What is NUMA?
+
+Non-Uniform Memory Access (NUMA) is a memory design used in multi-processor systems where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors).
+
+### Implementation
+
+The NUMA support is implemented in `pkg/util/numa/`:
+
+- **Topology Detection** (`numa.go`, `numa_linux.go`): Automatically detects NUMA topology using `/sys/devices/system/node/` filesystem
+- **NUMA-Local Buffer Pool** (`pool.go`): Provides buffer pools that allocate memory from local NUMA nodes
+- **Thread Pinning** (`numa_linux.go`): Allows threads to be pinned to specific NUMA nodes
+
+### Key Components
+
+#### NUMABufferPool
+
+```go
+pool := numa.NewNUMABufferPool(&numa.BufferPoolConfig{
+    BufferSize:      256 * 1024,  // 256KB buffers
+    PerNodePoolSize: 1024,        // 1024 buffers per node
+    EnableNUMA:      true,
+})
+
+buf := pool.Get()  // Get buffer from local NUMA node
+// use buffer...
+pool.Put(buf)      // Return buffer to pool
+```
+
+#### Thread Pinning
+
+```go
+// Pin current goroutine to NUMA node 0
+numa.PinThreadToNode(0)
+defer numa.UnpinThread()
+
+// Or use RunOnNode for a function
+numa.RunOnNode(0, func() {
+    // This function runs on NUMA node 0
+})
+```
+
+### Performance Benefits
+
+- Reduced memory latency by accessing local NUMA nodes
+- Better cache utilization
+- Reduced cross-socket traffic
+- Predictable performance on multi-socket systems
+
+### Configuration
+
+Enable NUMA support in the configuration file:
+
+```json
+{
+  "performance": {
+    "enableNUMA": true,
+    "numaBufferPoolSize": 1024,
+    "numaBufferSize": 262144
+  }
+}
+```
+
+## 2. io_uring Backend Storage
+
+### What is io_uring?
+
+io_uring is a Linux kernel interface for asynchronous I/O that was introduced in Linux 5.1. It provides a highly efficient interface for submitting and completing I/O operations with minimal system call overhead.
+
+### Benefits of io_uring
+
+- Reduced system call overhead (batching of operations)
+- Lower latency for I/O operations
+- Higher throughput especially for high queue depth workloads
+- Better CPU efficiency
+
+### Implementation
+
+The io_uring backend is implemented in `pkg/scsi/backingstore/iouring/`:
+
+- **Async I/O Operations**: Read, Write, and Fsync using io_uring
+- **Queue Management**: Configurable queue depth
+- **Fallback Support**: Automatically falls back to regular I/O on older kernels
+
+### Usage
+
+Enable io_uring in the storage configuration:
+
+```json
+{
+  "storages": [
+    {
+      "deviceID": 1000,
+      "path": "/var/tmp/disk.img",
+      "online": true,
+      "backendType": "iouring",
+      "ioUringQueueDepth": 4096
+    }
+  ],
+  "performance": {
+    "enableIoUring": true,
+    "ioUringQueueDepth": 4096
+  }
+}
+```
+
+### Backend Type Options
+
+- `file` - Standard synchronous file I/O (default)
+- `iouring` - io_uring-based asynchronous I/O (Linux 5.1+)
+
+### Requirements
+
+- Linux kernel 5.1 or later
+- x86_64, ARM64, or other supported architectures
+- O_DIRECT support recommended for best performance
+
+## 3. Combined Configuration Example
+
+For maximum performance, combine both NUMA and io_uring:
+
+```json
+{
+  "storages": [
+    {
+      "deviceID": 1000,
+      "path": "/var/tmp/disk.img",
+      "online": true,
+      "backendType": "iouring",
+      "enableNUMA": true,
+      "numaNode": 0,
+      "ioUringQueueDepth": 4096
+    }
+  ],
+  "iscsiportals": [
+    {
+      "id": 0,
+      "portal": "192.168.1.100:3260"
+    }
+  ],
+  "iscsitargets": {
+    "iqn.2024-01.com.gotgt:fast-storage": {
+      "tpgts": { "1": [0] },
+      "luns": { "1": 1000 }
+    }
+  },
+  "performance": {
+    "enableNUMA": true,
+    "enableIoUring": true,
+    "ioUringQueueDepth": 4096,
+    "numaBufferPoolSize": 1024,
+    "numaBufferSize": 262144
+  }
+}
+```
+
+## 4. Performance Tuning Guide
+
+### NUMA Tuning
+
+1. **Determine NUMA Topology**:
+   ```bash
+   numactl --hardware
+   lscpu | grep NUMA
+   ```
+
+2. **Align Network and Storage**:
+   - Ensure network interfaces are on the same NUMA node as the iSCSI process
+   - Place storage devices on the same NUMA node if possible
+
+3. **Buffer Pool Sizing**:
+   - `numaBufferPoolSize`: Number of buffers per node (default: 1024)
+   - `numaBufferSize`: Size of each buffer (default: 256KB)
+   - Size based on expected concurrent I/O and I/O size
+
+### io_uring Tuning
+
+1. **Queue Depth**:
+   - Higher queue depth = better throughput, higher latency
+   - Lower queue depth = lower latency, lower throughput
+   - Typical values: 128-4096 depending on workload
+
+2. **I/O Size**:
+   - Match application I/O size for best efficiency
+   - Use direct I/O (O_DIRECT) to bypass page cache if appropriate
+
+3. **System Limits**:
+   ```bash
+   # Check current limits
+   ulimit -a
+   
+   # Increase if needed (in /etc/security/limits.conf)
+   * soft nofile 1048576
+   * hard nofile 1048576
+   ```
+
+## 5. Benchmarking
+
+Use the following tools to benchmark performance:
+
+1. **fio** (Flexible I/O Tester):
+   ```bash
+   fio --name=iscsi-test --ioengine=libaio --iodepth=32 \
+       --rw=randread --bs=4k --direct=1 --size=1G \
+       --filename=/dev/sdX
+   ```
+
+2. **iperf3** (for network bandwidth):
+   ```bash
+   iperf3 -c <target-ip> -p 3260
+   ```
+
+3. **iscsi-perf** (if available from libiscsi)
+
+## 6. Troubleshooting
+
+### NUMA Issues
+
+- Check if NUMA is available: `numa.Available()`
+- Verify topology detection: Check logs for NUMA node count
+- Thread pinning failures: Ensure sufficient privileges (CAP_SYS_NICE)
+
+### io_uring Issues
+
+- Kernel version check: `uname -r` (must be 5.1+)
+- io_uring availability: Check if `/proc/sys/kernel/io_uring_disabled` exists
+- Permission issues: Ensure user has appropriate file permissions
+
+## 7. Future Enhancements
+
+Potential future optimizations:
+
+1. **DPDK Support** - Kernel-bypass networking for iSCSI
+2. **SPDK Integration** - User-space NVMe driver support
+3. **CPU Affinity Configuration** - Fine-grained CPU pinning
+4. **Memory Interleaving** - Automatic memory interleaving policies
+5. **Adaptive Buffer Sizing** - Dynamic buffer pool sizing based on workload
+
+## References
+
+- [io_uring by Jens Axboe](https://kernel.dk/io_uring.pdf)
+- [NUMA FAQ](https://www.kernel.org/doc/html/latest/vm/numa.html)
+- [iSCSI RFC 7143](https://tools.ietf.org/html/rfc7143)