CoreWeave Vera Rubin NVL72 vs H100 vs AWS Trainium2: Best GPU Cloud for LLM Inference APAC 2026
On 18 June 2025, CoreWeave announced the industry's first production deployment of the NVIDIA Vera Rubin NVL72 cluster — a 72-GPU NVLink-fused rack system built on NVIDIA's Rubin architecture. The announcement landed quietly, but its implications for APAC enterprises running LLM inference are anything but quiet. If you signed an H100 reserved contract in 2024, you now have a strategic decision to make.
This article gives you a data-grounded comparison of Vera Rubin NVL72, H100 SXM5, and AWS Trainium2 across the three dimensions that actually matter for inference workloads: raw throughput, cost-per-token, and APAC availability.
Why Vera Rubin NVL72 Changes the APAC GPU Calculus
NVIDIA's Vera Rubin architecture (successor to Blackwell) introduces a tightly coupled CPU+GPU die design. The NVL72 rack integrates 72 Rubin GPUs over fifth-generation NVLink, delivering significantly higher inter-GPU bandwidth than the 900 GB/s NVLink 4.0 found in H100 NVL8 configurations. NVIDIA's published spec sheet for Vera Rubin claims ~3.3× the FP8 inference throughput of the H100 SXM5 at rack level — though production-validated, customer-facing benchmarks remain limited to CoreWeave's initial deployment data.
CoreWeave is currently the sole hyperscaler with live NVL72 capacity. AWS, GCP, and Azure have not yet announced GA availability for Rubin-class silicon. That monopoly window matters for APAC buyers evaluating whether to join a waitlist now or extend H100 contracts while supply normalises.
Specification Snapshot: Vera Rubin NVL72 vs H100 SXM5 vs AWS Trainium2
- NVIDIA H100 SXM5 (per GPU): 80 GB HBM3, 3.35 TB/s memory bandwidth, 3,958 TFLOPS FP8. Spot pricing on CoreWeave/Lambda: $2.49–$3.20/GPU-hour (June 2025 market rate). Reserved 1-year: ~$1.80–$2.10/GPU-hour.
- CoreWeave Vera Rubin NVL72 (per rack, 72 GPU): HBM4 per die, NVLink 5.0. NVIDIA projects >10,000 TFLOPS FP8 per GPU at rack scale. CoreWeave has not yet published public per-GPU-hour list pricing; early enterprise quotes are circulating at a ~15–25% premium over H100 reserved rates, but with 3× throughput the effective cost-per-token can be lower.
- AWS Trainium2 (trn2.48xlarge, 16 chips): 16 × 512 GB HBM2e, 83.2 TB/s aggregate NeuronLink bandwidth. AWS list price: $21.50/hour on-demand (us-east-1); ~$12.80/hour 1-year reserved. No APAC (ap-southeast-1 / ap-northeast-1) GA availability as of June 2025 — only us-east-1/us-west-2.
Inference Throughput: What the Numbers Actually Mean
For a 70B-parameter model (e.g., Llama 4 Scout equivalent) running FP8 quantisation at batch size 64:
- H100 SXM5 × 8 (DGX-equivalent): ~12,000–14,000 tokens/second sustained throughput. Real-world data from vLLM benchmarks on CoreWeave clusters, published May 2025.
- Vera Rubin NVL72: NVIDIA's architecture projections suggest the same 70B model could achieve 35,000–42,000 tokens/second at rack level — roughly 3× H100. No third-party replication is available yet; treat this as vendor-projected until independent benchmarks emerge.
- AWS Trainium2 trn2.48xlarge: AWS's own benchmarks show ~9,800 tokens/second for a Llama-2 70B equivalent, with a caveat that Neuron SDK compilation overhead adds 15–20 minutes per model version change — a non-trivial cost for teams doing frequent prompt-engineering iterations.
The practical implication: if your APAC inference cluster currently runs 8× H100s at $2.00/GPU-hour reserved and produces 13,000 tokens/second, your cost-per-million-tokens sits around $0.34–$0.40. If Vera Rubin NVL72 delivers even 2.5× throughput at a 20% rate premium, that figure drops to roughly $0.15–$0.18/million tokens — a 50%+ reduction.
APAC Availability: The Critical Constraint
Raw specs mean nothing if the hardware isn't in-region. Latency between a US-West CoreWeave PoP and an end-user in Singapore or Tokyo adds 150–200 ms round-trip — unacceptable for real-time inference in iGaming, trading, or conversational AI.
- H100 SXM5: Available today across CoreWeave (Chicago, Frankfurt), Lambda Labs (US), GCP (us-central1, europe-west4), and limited availability via Alibaba Cloud ACK + GPU nodes in ap-southeast-1 (Singapore). APAC H100 spot remains constrained with 1–3 week lead times on reserved pools.
- Vera Rubin NVL72: CoreWeave's initial deployment is US-based. No confirmed APAC PoP timeline as of June 2025. CoreWeave has a Singapore facility roadmap, but no public GA date. APAC buyers face a 6–12 month wait for in-region Rubin capacity.
- AWS Trainium2: US-only GA. APAC availability has been "coming soon" since re:Invent 2023. For APAC-native inference, Trainium2 is currently a non-option unless your architecture tolerates cross-Pacific latency.
Vendor Lock-In and Portability Risk
This is where the decision becomes strategic rather than purely technical. Vera Rubin NVL72 is CoreWeave-exclusive today. That creates a concentration risk: if CoreWeave experiences capacity constraints, pricing shifts, or service disruptions, you have no immediate fallback with equivalent hardware.
AWS Trainium2 carries a different lock-in vector: the Neuron SDK. Models must be compiled for Trainium — PyTorch and TensorRT models don't run natively. Migration off Trainium back to CUDA-native environments requires recompilation and re-validation, adding 2–4 weeks of engineering time per major model update.
H100-based CUDA workloads remain the most portable: they run identically on CoreWeave, Lambda, GCP, AWS (p5 instances), Azure (ND H100 v5), and Alibaba Cloud GPU nodes. For APAC enterprises requiring multi-cloud failover — a hard requirement for iGaming licensees under PAGCOR, MGA, or Curaçao frameworks — H100 CUDA portability is a genuine operational advantage that Rubin and Trainium cannot yet match.
LLM API Overhead: Llama 4 Scout and the Open-Source Shift
Meta's Llama 4 Scout has become the reference open-source long-context multimodal model for APAC enterprise deployments in 2025. At 17B active parameters (mixture-of-experts architecture), it delivers strong throughput-per-dollar on H100 clusters