Data Center Cooling Basics

The rapid growth of AI has fundamentally shifted the power landscape of the data center. Traditional servers that once ran on a few hundred watts have been replaced by AI “beasts” that consume more power in a single chip than an entire legacy server.

Server-Level Power Consumption

Standard servers are built for sequential tasks (web hosting, databases), while AI servers are built for massive parallel processing (training LLMs, image generation).

Standard CPU Server: Typically consumes 600–750 Watts. Historically, individual CPUs ran at roughly 150–200 Watts.
New AI GPU Server: A single high-performance AI server node (e.g., an 8-GPU chassis) can consume 10–15 Kilowatts (kW).
- GPU Power Surge: Modern AI chips now draw 700–1,200 Watts each, compared to just 400 Watts in 2022.

The “New” AI Chips (GPUs & Accelerators)

NVIDIA, AMD, and Intel are locked in a race where performance gains are often accompanied by significant jumps in Thermal Design Power (TDP).

Chip Model	Architecture	Max Power (TDP)	Key Highlight
NVIDIA H100 / H200	Hopper	700W	The current industry standard for LLM training.
NVIDIA Blackwell B200	Blackwell	1,000W – 1,200W	Delivers up to 15x inference performance of H100.
NVIDIA GB200 (Superchip)	Grace-Blackwell	~2,700W	Pairs 1 Grace CPU with 2 Blackwell GPUs on one board.
AMD Instinct MI300X	CDNA 3	750W	Features massive 192GB HBM3 memory.
AMD Instinct MI355X	CDNA 4	1,000W – 1,400W	Future flagship; expected to require liquid cooling.
Intel Gaudi 3	Gaudi 3	600W – 900W	Focuses on inference efficiency and cost-effectiveness.

Rack-Level Density: From Kilowatts to Megawatts

The “rack” is where the power density becomes a massive engineering challenge.

Traditional Racks: Average 5–15 kW per rack. These are easily cooled by standard air conditioning (CRAC) units.
Current AI Racks (2024-2025): Average 40–80 kW, with high-end clusters reaching 100–140 kW.
Next-Gen AI Racks (2026+): Designs are pushing toward 200–350 kW per rack.
- NVIDIA’s GB200 NVL72 rack-scale system is rated for 132 kW peak power.
- Future configurations (like Blackwell Ultra or Rubin) are targeting densities as high as 900 kW to 1 Megawatt (MW) per rack.

Summary of Infrastructure Changes

As racks cross the 30 kW threshold, traditional air cooling fails. Most modern AI data centers now must integrate:

Direct-to-Chip (Liquid) Cooling: Bringing coolant directly to the 1,000W+ chips.
Rear-Door Heat Exchangers (RDHx): Capturing heat at the back of the rack.
Reinforced Floors: AI racks are much heavier (up to 5,000 lbs) due to high-density hardware and liquid cooling components.

Would you like to see a comparison of cooling costs for these high-density racks versus traditional setups?

Data Center Design Fundamentals

A well-designed data center balances performance, reliability, and efficiency. Design is typically categorized into two main physical areas: White Space (where IT equipment like servers and racks reside) and Gray Space (where back-end infrastructure like UPS systems and chillers are located).

Redundancy and Reliability

Data centers are often rated by their redundancy levels to ensure continuous operation even during component failures. The industry standard follows the “N” notation:

Level	Description	Reliability Impact
N	Base requirement	No redundancy; any failure causes downtime.
N+1	One extra component	Allows for one component failure or maintenance without downtime.
2N	Fully redundant	Two independent systems; one can fail entirely without impact.
2N+1	Fault tolerant + extra	The highest level of reliability for mission-critical facilities.

Industry Standards

Professional data center design is guided by several global standards:

TIA-942: Focuses on telecommunications infrastructure and site space.
BICSI-002: Provides best practices for all aspects of data center design.
Uptime Institute Tiers: A 4-tier system (Tier I to Tier IV) measuring site availability and fault tolerance.
ASHRAE Technical Committee (TC) 9.9 provides thermal guidelines for data processing environments

Cooling Technologies: Air vs. Liquid

Air Cooling

Liquid Cooling

Comparison of Cooling Methods

Feature	Air Cooling	Liquid Cooling
Mechanism	Uses ambient air and fans to dissipate heat.	Uses water or specialized coolants for heat absorption.
Efficiency	Lower; limited by air’s thermal capacity.	Higher; liquid is 3,000x more effective at carrying heat.
Complexity	Simpler and cost-effective for low density.	More complex; requires specialized plumbing and CDU.
Best For	Standard enterprise workloads.	AI, Machine Learning, and HPC workloads.

Maintaining the right environmental conditions is essential. ASHRAE Technical Committee 9.9 (TC 9.9) recommends keeping data centers between 18°C and 27°C (64.4°F to 80.6°F). As power densities increase, the industry is shifting from traditional air cooling to advanced liquid solutions.

Liquid Cooling: CDU, TCS, and FWS

For high-density AI workloads, a single cooling loop is often insufficient. Modern liquid cooling architectures utilize a Coolant Distribution Unit (CDU) to manage two distinct loops, ensuring the highest water quality and precision control.

The CDU acts as the “middle-man” between the facility’s main water supply and the sensitive IT equipment. It manages the heat exchange between two loops:

Facility Water System (FWS): The primary loop that runs throughout the building, carrying heat away to external chillers or cooling towers.
Technology Cooling System (TCS): The secondary, high-purity loop that circulates directly through cold plates or immersion tanks. This loop uses treated water or a water-glycol mix to prevent corrosion and clogging in micro-channels.

Why Two Loops? Most facility water (FWS) is not clean enough for direct contact with high-performance chips. The CDU isolates the loops, allowing the TCS to maintain surgical-grade cleanliness while providing precision control over pressure, flow, and temperature—critical for AI workloads that can spike in power consumption in milliseconds.

Power Usage Effectiveness (PUE) is the industry-standard metric for measuring the energy efficiency of a data center. It was developed by The Green Grid and is defined as the ratio of the total amount of energy used by a data center facility to the energy delivered to computing equipment.

PUE Formula:

What Does PUE Mean?

1.0 (Perfect Efficiency): A theoretical ideal where all power goes directly to IT equipment (servers, storage, networking) with no energy wasted on cooling, lighting, or power distribution.
Average PUE (~1.58): According to Uptime Institute, the average PUE for data centers in 2020 was 1.58, indicating about 58% more energy is used for infrastructure than for computing.
Inefficient (>2.0): Older facilities often operate at 2.0 or higher, meaning as much energy is spent on cooling and power loss as is used for computing.
Best-in-Class (<1.2): Hyperscale providers like Google or Meta often achieve PUEs below 1.1 through advanced cooling, such as liquid cooling, and efficient facility design.

Key Components of PUE

IT Load: The energy consumed by servers, networking equipment, and storage devices.
Cooling Systems: Chillers, cooling towers, fans, and CRAC (Computer Room Air Conditioning) units, which are often the largest source of non-IT energy consumption.
Power Distribution Loss: Energy lost during conversion in Uninterruptible Power Supply (UPS) units, transformers, and Power Distribution Units (PDUs).
Lighting and Security: Ancillary systems, including lighting and monitoring equipment.

Strategies to Improve PUE

Cold/Hot Aisle Containment: Physically separating cold intake air from hot exhaust air reduces air mixing, significantly improving cooling efficiency.
Liquid Cooling: Direct-to-chip or immersion cooling technologies can reduce cooling-related electricity usage by up to 95% compared to traditional air cooling.
Free Cooling: Leveraging ambient outside air instead of mechanical chillers in cooler climates.
Virtualization: Consolidating multiple virtual servers on fewer physical machines reduces both IT energy consumption and cooling loads.
Upgrading UPS Systems: Replacing legacy UPS systems with modern, high-efficiency models minimizes power distribution losses.

Limitations of PUE

While PUE is essential, it is not a complete measure of sustainability.

Doesn’t Measure IT Workload Efficiency: A data center can have a low PUE but still be inefficient if the servers are idling or running underutilized apps.
Doesn’t Consider Energy Source: A data center powered by coal with a 1.2 PUE is less sustainable than one powered by renewables with a 1.4 PUE.
Regional Variations: Climate impacts cooling needs, making direct PUE comparisons between data centers in different regions unfair.

Related Metrics

DCiE (Data Center Infrastructure Efficiency): The reciprocal of PUE (Power Usage Effectiveness), expressed as a percentage.
WUE (Water Usage Effectiveness): Measures water consumption relative to IT equipment energy, especially critical in water-scarce regions.
CUE (Carbon Usage Effectiveness): Measures total carbon emissions relative to IT energy consumption.

Data Center Abbreviations Glossary

The data center industry uses a unique vocabulary. Below is a comprehensive list of the most common abbreviations used in design and operations.

Abbreviation	Full Term	Definition
AHU	Air Handling Unit	A device used to regulate and circulate air as part of an HVAC system.
ASHRAE	American Society of Heating, Refrigerating and Air-Conditioning Engineers	The global technical society for HVAC&R.
ATS	Automatic Transfer Switch	Automatically switches power to a backup source during an outage.
BMS	Building Management System	Controls mechanical and electrical equipment like HVAC and power.
CDU	Coolant Distribution Unit	Manages the flow and temperature of coolant in liquid cooling systems.
CFD	Computational Fluid Dynamics	Software used to model airflow and heat within the data center.
CRAC	Computer Room Air Conditioning	Traditional units that use refrigerant to cool the room.
CRAH	Computer Room Air Handling	Units that use chilled water to cool large-scale deployments.
D2C	Direct-to-Chip	Cooling liquid delivered directly to the processor via cold plates.
DCIM	Data Center Infrastructure Management	Tools for monitoring and managing facility infrastructure.
EPO	Emergency Power Off	A safety system to rapidly shut down power in an emergency.
FWS	Facility Water System	The primary cooling loop that connects the data center to the building’s chillers.
FWU	Fan Wall Unit	A large-scale air circulation system that provides uniform airflow.
HPC	High-Performance Computing	Using supercomputers and parallel processing for complex tasks.
HVAC	Heating, Ventilation, and Air Conditioning	The systems used to control temperature and humidity.
MDF	Main Distribution Frame	The central point for connecting external and internal network lines.
MMR	Meet Me Room	A secure space where different providers connect their networks.
PDU	Power Distribution Unit	A device with multiple outlets to distribute power to server racks.
PUE	Power Usage Effectiveness	The ratio of total facility power to IT equipment power (Goal: closer to 1.0).
RDHx	Rear Door Heat Exchanger	A cooling coil mounted on the back of a server rack.
TCS	Technology Cooling System	The secondary, high-purity cooling loop that directly cools IT equipment.
UPS	Uninterruptible Power Supply	Provides battery backup when the primary power source fails.
WUE	Water Usage Effectiveness	Measures the efficiency of water used for cooling.
STS	Static Transfer Switch	Uses power electronics to switch between two power sources instantly.
VFD	Variable Frequency Drive	Controls motor speed to save energy in fans and pumps.
SAN	Storage Area Network	A specialized, high-speed network for block-level data storage.
VLAN	Virtual Local Area Network	A logical subnetwork that groups together a collection of devices.