← Back to home → All Articles
📂 GPU 📅 June 14, 2026 📝 1300 words

CoreWeave Vera Rubin NVL72 vs H100 vs AWS Trainium2: Best GPU Cloud for LLM Inference APAC 2026

On 18 June 2025, CoreWeave announced the industry's first production deployment of the NVIDIA Vera Rubin NVL72 cluster — a 72-GPU NVLink-fused rack system built on NVIDIA's Rubin architecture. The announcement landed quietly, but its implications for APAC enterprises running LLM inference are anything but quiet. If you signed an H100 reserved contract in 2024, you now have a strategic decision to make.

This article gives you a data-grounded comparison of Vera Rubin NVL72, H100 SXM5, and AWS Trainium2 across the three dimensions that actually matter for inference workloads: raw throughput, cost-per-token, and APAC availability.


Why Vera Rubin NVL72 Changes the APAC GPU Calculus

NVIDIA's Vera Rubin architecture (successor to Blackwell) introduces a tightly coupled CPU+GPU die design. The NVL72 rack integrates 72 Rubin GPUs over fifth-generation NVLink, delivering significantly higher inter-GPU bandwidth than the 900 GB/s NVLink 4.0 found in H100 NVL8 configurations. NVIDIA's published spec sheet for Vera Rubin claims ~3.3× the FP8 inference throughput of the H100 SXM5 at rack level — though production-validated, customer-facing benchmarks remain limited to CoreWeave's initial deployment data.

CoreWeave is currently the sole hyperscaler with live NVL72 capacity. AWS, GCP, and Azure have not yet announced GA availability for Rubin-class silicon. That monopoly window matters for APAC buyers evaluating whether to join a waitlist now or extend H100 contracts while supply normalises.

Specification Snapshot: Vera Rubin NVL72 vs H100 SXM5 vs AWS Trainium2

Inference Throughput: What the Numbers Actually Mean

For a 70B-parameter model (e.g., Llama 4 Scout equivalent) running FP8 quantisation at batch size 64:

The practical implication: if your APAC inference cluster currently runs 8× H100s at $2.00/GPU-hour reserved and produces 13,000 tokens/second, your cost-per-million-tokens sits around $0.34–$0.40. If Vera Rubin NVL72 delivers even 2.5× throughput at a 20% rate premium, that figure drops to roughly $0.15–$0.18/million tokens — a 50%+ reduction.

APAC Availability: The Critical Constraint

Raw specs mean nothing if the hardware isn't in-region. Latency between a US-West CoreWeave PoP and an end-user in Singapore or Tokyo adds 150–200 ms round-trip — unacceptable for real-time inference in iGaming, trading, or conversational AI.

Vendor Lock-In and Portability Risk

This is where the decision becomes strategic rather than purely technical. Vera Rubin NVL72 is CoreWeave-exclusive today. That creates a concentration risk: if CoreWeave experiences capacity constraints, pricing shifts, or service disruptions, you have no immediate fallback with equivalent hardware.

AWS Trainium2 carries a different lock-in vector: the Neuron SDK. Models must be compiled for Trainium — PyTorch and TensorRT models don't run natively. Migration off Trainium back to CUDA-native environments requires recompilation and re-validation, adding 2–4 weeks of engineering time per major model update.

H100-based CUDA workloads remain the most portable: they run identically on CoreWeave, Lambda, GCP, AWS (p5 instances), Azure (ND H100 v5), and Alibaba Cloud GPU nodes. For APAC enterprises requiring multi-cloud failover — a hard requirement for iGaming licensees under PAGCOR, MGA, or Curaçao frameworks — H100 CUDA portability is a genuine operational advantage that Rubin and Trainium cannot yet match.

LLM API Overhead: Llama 4 Scout and the Open-Source Shift

Meta's Llama 4 Scout has become the reference open-source long-context multimodal model for APAC enterprise deployments in 2025. At 17B active parameters (mixture-of-experts architecture), it delivers strong throughput-per-dollar on H100 clusters

Want to know where you are overpaying on cloud?

Get a Free Cloud Cost Audit →