Engineered by NXON for RoCEv2-optimized, multi-petabyte AI training environments.
Modern AI training requires storage that keeps pace with the world's fastest GPUs. NXON integrates WEKA's high-performance file system with our expertly engineered RoCEv2 RDMA fabric, enabling:
- •Consistent sub-millisecond latency, even during metadata-heavy checkpoint bursts
- •3 TB/s reads & 1.5 TB/s writes aggregate throughput
- •Up to 40 GB/s per GPU node, no manual tuning
- •Scaling to 140+ GPU clients with zero performance degradation
- •Predictable checkpoint times → better GPU utilization
- •<2 rack footprint for 4.7 PB usable Tier-1 NVMe
- •Distributed, parallel metadata architecture
- •Single NVMe tier (no tiering overhead)
- •Kernel-bypass I/O client for maximum bandwidth
- •Unified POSIX namespace for datasets & checkpoints
- •2 × 400 GbE uplinks per backend node
- •Fully lossless, RoCEv2-enabled RDMA fabric
Designed, engineered, and validated by NXON
NXON's team delivered:
- •RoCEv2 RDMA fabric tuning (PFC, ECN, buffer optimization)
- •Lossless Ethernet configuration
- •GPU node I/O optimization (CPU pinning + NIC affinity)
- •Non-disruptive expansion capability
Cluster Throughput
- •~3 TB/s Reads
- •~1.5 TB/s Writes
Per-Node GPU Throughput
- •Up to 40 GB/s from dual-200 GbE GPU servers
- •~80% link efficiency
Latency
- •Stable <0.5 ms, even under heavy metadata load
- •Millions of IOPS with no tail-latency spikes
Scalability
- •Sustained performance across 140+ GPU nodes
- •Zero loss in per-node throughput
- •Automatic backend load balancing
- •High throughput with lower power consumption
- •Achieved 4.7 PB usable in under 2 racks
- •Lower cooling and datacenter footprint
- •Predictable checkpointing, enabling higher training cadence
- •Reduced GPU idle time → higher GPU utilization
- •Smoother operations, no manual tuning
- •Non-disruptive scaling as datasets and models grow
NXON provides end-to-end GPU infrastructure expertise:
- •Storage architecture design
- •RoCEv2 fabric engineering
- •Deployment in under four days
- •Data migration + parallel copy strategies
- •Ongoing tuning for scale-out AI workloads
Ready to Accelerate Your AI Training Pipeline? NXON builds storage architectures designed for the next generation of AI.
