# Performance Optimizations for gotgt This document describes the performance optimizations implemented for gotgt, focusing on NUMA-aware memory allocation and io_uring backend storage support. ## Overview Two major performance optimizations have been implemented: 1. **NUMA-Aware Memory Allocation** - Optimizes memory access patterns on multi-socket systems 2. **io_uring Backend Storage** - Provides high-performance asynchronous I/O on Linux 5.1+ ## 1. NUMA-Aware Memory Allocation ### What is NUMA? Non-Uniform Memory Access (NUMA) is a memory design used in multi-processor systems where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors). ### Implementation The NUMA support is implemented in `pkg/util/numa/`: - **Topology Detection** (`numa.go`, `numa_linux.go`): Automatically detects NUMA topology using `/sys/devices/system/node/` filesystem - **NUMA-Local Buffer Pool** (`pool.go`): Provides buffer pools that allocate memory from local NUMA nodes - **Thread Pinning** (`numa_linux.go`): Allows threads to be pinned to specific NUMA nodes ### Key Components #### NUMABufferPool ```go pool := numa.NewNUMABufferPool(&numa.BufferPoolConfig{ BufferSize: 256 * 1024, // 256KB buffers PerNodePoolSize: 1024, // 1024 buffers per node EnableNUMA: true, }) buf := pool.Get() // Get buffer from local NUMA node // use buffer... pool.Put(buf) // Return buffer to pool ``` #### Thread Pinning ```go // Pin current goroutine to NUMA node 0 numa.PinThreadToNode(0) defer numa.UnpinThread() // Or use RunOnNode for a function numa.RunOnNode(0, func() { // This function runs on NUMA node 0 }) ``` ### Performance Benefits - Reduced memory latency by accessing local NUMA nodes - Better cache utilization - Reduced cross-socket traffic - Predictable performance on multi-socket systems ### Configuration Enable NUMA support in the configuration file: ```json { "performance": { "enableNUMA": true, "numaBufferPoolSize": 1024, "numaBufferSize": 262144 } } ``` ## 2. io_uring Backend Storage ### What is io_uring? io_uring is a Linux kernel interface for asynchronous I/O that was introduced in Linux 5.1. It provides a highly efficient interface for submitting and completing I/O operations with minimal system call overhead. ### Benefits of io_uring - Reduced system call overhead (batching of operations) - Lower latency for I/O operations - Higher throughput especially for high queue depth workloads - Better CPU efficiency ### Implementation The io_uring backend is implemented in `pkg/scsi/backingstore/iouring/`: - **Async I/O Operations**: Read, Write, and Fsync using io_uring - **Queue Management**: Configurable queue depth - **Fallback Support**: Automatically falls back to regular I/O on older kernels ### Usage Enable io_uring in the storage configuration: ```json { "storages": [ { "deviceID": 1000, "path": "/var/tmp/disk.img", "online": true, "backendType": "iouring", "ioUringQueueDepth": 4096 } ], "performance": { "enableIoUring": true, "ioUringQueueDepth": 4096 } } ``` ### Backend Type Options - `file` - Standard synchronous file I/O (default) - `iouring` - io_uring-based asynchronous I/O (Linux 5.1+) ### Requirements - Linux kernel 5.1 or later - x86_64, ARM64, or other supported architectures - O_DIRECT support recommended for best performance ## 3. Combined Configuration Example For maximum performance, combine both NUMA and io_uring: ```json { "storages": [ { "deviceID": 1000, "path": "/var/tmp/disk.img", "online": true, "backendType": "iouring", "enableNUMA": true, "numaNode": 0, "ioUringQueueDepth": 4096 } ], "iscsiportals": [ { "id": 0, "portal": "192.168.1.100:3260" } ], "iscsitargets": { "iqn.2024-01.com.gotgt:fast-storage": { "tpgts": { "1": [0] }, "luns": { "1": 1000 } } }, "performance": { "enableNUMA": true, "enableIoUring": true, "ioUringQueueDepth": 4096, "numaBufferPoolSize": 1024, "numaBufferSize": 262144 } } ``` ## 4. Performance Tuning Guide ### NUMA Tuning 1. **Determine NUMA Topology**: ```bash numactl --hardware lscpu | grep NUMA ``` 2. **Align Network and Storage**: - Ensure network interfaces are on the same NUMA node as the iSCSI process - Place storage devices on the same NUMA node if possible 3. **Buffer Pool Sizing**: - `numaBufferPoolSize`: Number of buffers per node (default: 1024) - `numaBufferSize`: Size of each buffer (default: 256KB) - Size based on expected concurrent I/O and I/O size ### io_uring Tuning 1. **Queue Depth**: - Higher queue depth = better throughput, higher latency - Lower queue depth = lower latency, lower throughput - Typical values: 128-4096 depending on workload 2. **I/O Size**: - Match application I/O size for best efficiency - Use direct I/O (O_DIRECT) to bypass page cache if appropriate 3. **System Limits**: ```bash # Check current limits ulimit -a # Increase if needed (in /etc/security/limits.conf) * soft nofile 1048576 * hard nofile 1048576 ``` ## 5. Benchmarking Use the following tools to benchmark performance: 1. **fio** (Flexible I/O Tester): ```bash fio --name=iscsi-test --ioengine=libaio --iodepth=32 \ --rw=randread --bs=4k --direct=1 --size=1G \ --filename=/dev/sdX ``` 2. **iperf3** (for network bandwidth): ```bash iperf3 -c -p 3260 ``` 3. **iscsi-perf** (if available from libiscsi) ## 6. Troubleshooting ### NUMA Issues - Check if NUMA is available: `numa.Available()` - Verify topology detection: Check logs for NUMA node count - Thread pinning failures: Ensure sufficient privileges (CAP_SYS_NICE) ### io_uring Issues - Kernel version check: `uname -r` (must be 5.1+) - io_uring availability: Check if `/proc/sys/kernel/io_uring_disabled` exists - Permission issues: Ensure user has appropriate file permissions ## 7. Future Enhancements Potential future optimizations: 1. **DPDK Support** - Kernel-bypass networking for iSCSI 2. **SPDK Integration** - User-space NVMe driver support 3. **CPU Affinity Configuration** - Fine-grained CPU pinning 4. **Memory Interleaving** - Automatic memory interleaving policies 5. **Adaptive Buffer Sizing** - Dynamic buffer pool sizing based on workload ## References - [io_uring by Jens Axboe](https://kernel.dk/io_uring.pdf) - [NUMA FAQ](https://www.kernel.org/doc/html/latest/vm/numa.html) - [iSCSI RFC 7143](https://tools.ietf.org/html/rfc7143)