Hero Banner
White Paper
November 18, 2025

WEKA-Powered Tier-1 NVMe Storage for Extreme GPU Training

NXON integrates WEKA's high-performance file system with expertly engineered RoCEv2 RDMA fabric, enabling consistent sub-millisecond latency, 3 TB/s reads & 1.5 TB/s writes aggregate throughput, and scaling to 140+ GPU clients with zero performance degradation.

NXON.AI

NXON AI Factory PTY. LTD

Engineered by NXON for RoCEv2-optimized, multi-petabyte AI training environments.

Modern AI training requires storage that keeps pace with the world's fastest GPUs. NXON integrates WEKA's high-performance file system with our expertly engineered RoCEv2 RDMA fabric, enabling:

  • Consistent sub-millisecond latency, even during metadata-heavy checkpoint bursts
  • 3 TB/s reads & 1.5 TB/s writes aggregate throughput
  • Up to 40 GB/s per GPU node, no manual tuning
  • Scaling to 140+ GPU clients with zero performance degradation
  • Predictable checkpoint times → better GPU utilization
  • <2 rack footprint for 4.7 PB usable Tier-1 NVMe
  • Distributed, parallel metadata architecture
  • Single NVMe tier (no tiering overhead)
  • Kernel-bypass I/O client for maximum bandwidth
  • Unified POSIX namespace for datasets & checkpoints
  • 2 × 400 GbE uplinks per backend node
  • Fully lossless, RoCEv2-enabled RDMA fabric

Designed, engineered, and validated by NXON

NXON's team delivered:

  • RoCEv2 RDMA fabric tuning (PFC, ECN, buffer optimization)
  • Lossless Ethernet configuration
  • GPU node I/O optimization (CPU pinning + NIC affinity)
  • Non-disruptive expansion capability

Cluster Throughput

  • ~3 TB/s Reads
  • ~1.5 TB/s Writes

Per-Node GPU Throughput

  • Up to 40 GB/s from dual-200 GbE GPU servers
  • ~80% link efficiency

Latency

  • Stable <0.5 ms, even under heavy metadata load
  • Millions of IOPS with no tail-latency spikes

Scalability

  • Sustained performance across 140+ GPU nodes
  • Zero loss in per-node throughput
  • Automatic backend load balancing
  • High throughput with lower power consumption
  • Achieved 4.7 PB usable in under 2 racks
  • Lower cooling and datacenter footprint
  • Predictable checkpointing, enabling higher training cadence
  • Reduced GPU idle time → higher GPU utilization
  • Smoother operations, no manual tuning
  • Non-disruptive scaling as datasets and models grow

NXON provides end-to-end GPU infrastructure expertise:

  • Storage architecture design
  • RoCEv2 fabric engineering
  • Deployment in under four days
  • Data migration + parallel copy strategies
  • Ongoing tuning for scale-out AI workloads

Ready to Accelerate Your AI Training Pipeline? NXON builds storage architectures designed for the next generation of AI.

Share this white paper

Help others discover this research

Exclusive Resource

Download Full White Paper

Get the complete technical document with detailed implementation insights, architecture diagrams, and best practices for deploying high-performance AI infrastructure.

📄PDF Format
4 Pages

Ready to Implement?

Contact our team to discuss implementation strategies and architecture.