June 26, 2026
During the training cycles of Large Language Models (LLMs) with tens or hundreds of billions of parameters, the core pain point of AI compute centers has gradually shifted from pure computing redundancy to high-density thermal management. When multiple core graphics cards (GPUs) run at full load for extended periods within a confined $4U$ space, the instantaneous Thermal Design Power (TDP) of a single chip can easily exceed $700W$. Due to the physical limitations of air specific heat capacity, traditional air cooling easily leads to thermal accumulation inside the servers, triggering the chip's thermal throttling mechanism and directly compromising the continuity and consistency of computing output.
To address this common industry challenge, Supermicro has introduced a deeply customized Cold Plate Liquid Cooling solution into its next-generation high-density GPU servers. This solution focuses on precise temperature control for high-heat-generation components.
Material & Craftsmanship: High-thermal-conductivity pure copper cold plates are applied directly to the surfaces of GPU chips and High Bandwidth Memory (HBM). The micro-channel structural design increases the thermal dissipation contact area by $35%$.
Working Conditions: Under harsh industrial conditions with an inlet coolant temperature of $32^circtext{C}$, the system stably maintains the core GPU operating temperature below $75^circtext{C}$.
Reliability Standards: The pipeline interfaces utilize aviation-grade non-drip quick connectors, certified through rigorous pressure pulse and anti-corrosion testing, eliminating any risk of coolant leakage over the IT equipment's operational lifespan.
By deploying Supermicro liquid-cooled servers, data centers can drastically cut the power consumption of high-wattage cooling fans. Experimental data indicates that under identical AI computing workloads, liquid cooling helps lower the overall data center Power Usage Effectiveness (PUE) from $1.6$ under traditional air cooling to approximately $1.15$, ensuring the stable operation of large-scale distributed training clusters while significantly reducing long-term operational electricity costs.