AI giants call for energy grid kumbaya
(2025/08/22)
- Reference: 1755893291
- News link: https://www.theregister.co.uk/2025/08/22/microsoft_nvidia_openai_power_grid/
- Source link:
Researchers at Microsoft, Nvidia, and OpenAI have issued a call to designers of software, hardware, infrastructure, and utilities for help finding ways to normalize power demand during AI training.
Nearly 60 scientists at the three firms have co-authored a paper about the need to address the power management challenges of AI training workloads. Their concern is that the fluctuating power demand of AI training threatens the electrical grid's ability to handle the variable load.
The paper, " [1]Power Stabilization for AI Training Datacenters ," argues that oscillating energy demand between the power-intensive GPU compute phase and the less-taxing communication phase, where parallelized GPU calculations get synchronized, represents a barrier to the development of AI models.
[2]
The authors note that the difference in power consumption between the compute and communication phases is extreme, the former approaching the thermal limits of the GPU and the latter being close to idle time energy usage.
[3]
[4]
This variation in power demand occurs at the node (server) level and across other nodes at the data center, due to the synchronous nature of AI training. So these oscillations become visible at the rack, datacenter, and power grid levels – imagine 50,000 hairdryers (~2000 watts) being turned on at once.
"At scale, these swings can amount to tens or hundreds of megawatts, occurring at frequencies that, if poorly aligned with the resonant characteristics of power grid components (e.g., turbine generators or long transmission lines), can risk grid instability and mechanical failure," the authors observe.
[5]
"These issues are not theoretical – multiple utility providers have now documented the impact of harmonics induced by synchronized computing loads."
Looking beyond just AI training, Schneider Electric expects the US grid [6]will become less stable by the end of the decade due to data center energy demand. A US Department of Energy [7]report published last December said, "data centers consumed about 4.4 percent of total US electricity in 2023 and are expected to consume approximately 6.7 to 12 percent of total US electricity by 2028."
[8]Microsoft lets devs tell Copilot to STFU in Visual Studio
[9]NIMBYs threaten to sink Project Sail, a $17B datacenter development in Georgia
[10]Bank reverses decision to replace 45 customer service staff with AI chatbot
[11]Anthropic scanning Claude chats for queries about DIY nukes for some reason
The boffins from Microsoft, Nvidia, and OpenAI have kicked off the power stabilization party with an evaluation of three different strategies, each of which has pros and cons.
There are software-based approaches, which help even out power usage by injecting secondary workloads when GPU activity falls below a certain threshold. The downsides are performance overhead, the need for customer-cloud provider collaboration, and unreliability.
GPU-level firmware features like power smoothing, supported in the Nvidia GB200, give developers and cloud providers a way to set a power utilization floor and set ramp-up and ramp-down rates. But power smoothing imposes an extra energy cost.
[12]
And data center-level capabilities like Battery Energy Storage Systems offer a mechanism for handling power demand spikes locally, without burdening the utility grid. But energy storage hardware can be expensive.
While these options are available today, the researchers argue that an optimal solution involves a combination of all three techniques. And, to make that a viable option, the folks at Microsoft, Nvidia, and OpenAI are asking for more coordination among vendors, so that rack-level energy storage and GPUs can communicate about workload state changes.
Specifically, the researchers want AI framework and system designers to focus on training algorithms that are asynchronous and power-aware; utility and grid operators to share resonance and ramp specifications and to standardize communication channels with data center operators; and for the tech industry to establish interoperable standards for telemetry, load signaling, and sub-synchronous oscillation mitigation.
"Together, we can design for a future where AI training is not only powerful, but also power-aware," the authors conclude. ®
Get our [13]Tech Resources
[1] https://arxiv.org/abs/2508.14318
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aKjoddEybkErEIMKXX7i2AAAAQo&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aKjoddEybkErEIMKXX7i2AAAAQo&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aKjoddEybkErEIMKXX7i2AAAAQo&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aKjoddEybkErEIMKXX7i2AAAAQo&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[6] https://www.theregister.com/2025/06/03/schneider_electric_says_us_grid/
[7] https://www.energy.gov/articles/doe-releases-new-report-evaluating-increase-electricity-demand-data-centers
[8] https://www.theregister.com/2025/08/22/visual_studio_copilot_muzzle/
[9] https://www.theregister.com/2025/08/22/georgia_datacenter_pushback/
[10] https://www.theregister.com/2025/08/22/commonwealth_ban_chatbot_fail_rehiring/
[11] https://www.theregister.com/2025/08/21/anthropic_claude_nuclear_chat_detection/
[12] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aKjoddEybkErEIMKXX7i2AAAAQo&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[13] https://whitepapers.theregister.com/
Nearly 60 scientists at the three firms have co-authored a paper about the need to address the power management challenges of AI training workloads. Their concern is that the fluctuating power demand of AI training threatens the electrical grid's ability to handle the variable load.
The paper, " [1]Power Stabilization for AI Training Datacenters ," argues that oscillating energy demand between the power-intensive GPU compute phase and the less-taxing communication phase, where parallelized GPU calculations get synchronized, represents a barrier to the development of AI models.
[2]
The authors note that the difference in power consumption between the compute and communication phases is extreme, the former approaching the thermal limits of the GPU and the latter being close to idle time energy usage.
[3]
[4]
This variation in power demand occurs at the node (server) level and across other nodes at the data center, due to the synchronous nature of AI training. So these oscillations become visible at the rack, datacenter, and power grid levels – imagine 50,000 hairdryers (~2000 watts) being turned on at once.
"At scale, these swings can amount to tens or hundreds of megawatts, occurring at frequencies that, if poorly aligned with the resonant characteristics of power grid components (e.g., turbine generators or long transmission lines), can risk grid instability and mechanical failure," the authors observe.
[5]
"These issues are not theoretical – multiple utility providers have now documented the impact of harmonics induced by synchronized computing loads."
Looking beyond just AI training, Schneider Electric expects the US grid [6]will become less stable by the end of the decade due to data center energy demand. A US Department of Energy [7]report published last December said, "data centers consumed about 4.4 percent of total US electricity in 2023 and are expected to consume approximately 6.7 to 12 percent of total US electricity by 2028."
[8]Microsoft lets devs tell Copilot to STFU in Visual Studio
[9]NIMBYs threaten to sink Project Sail, a $17B datacenter development in Georgia
[10]Bank reverses decision to replace 45 customer service staff with AI chatbot
[11]Anthropic scanning Claude chats for queries about DIY nukes for some reason
The boffins from Microsoft, Nvidia, and OpenAI have kicked off the power stabilization party with an evaluation of three different strategies, each of which has pros and cons.
There are software-based approaches, which help even out power usage by injecting secondary workloads when GPU activity falls below a certain threshold. The downsides are performance overhead, the need for customer-cloud provider collaboration, and unreliability.
GPU-level firmware features like power smoothing, supported in the Nvidia GB200, give developers and cloud providers a way to set a power utilization floor and set ramp-up and ramp-down rates. But power smoothing imposes an extra energy cost.
[12]
And data center-level capabilities like Battery Energy Storage Systems offer a mechanism for handling power demand spikes locally, without burdening the utility grid. But energy storage hardware can be expensive.
While these options are available today, the researchers argue that an optimal solution involves a combination of all three techniques. And, to make that a viable option, the folks at Microsoft, Nvidia, and OpenAI are asking for more coordination among vendors, so that rack-level energy storage and GPUs can communicate about workload state changes.
Specifically, the researchers want AI framework and system designers to focus on training algorithms that are asynchronous and power-aware; utility and grid operators to share resonance and ramp specifications and to standardize communication channels with data center operators; and for the tech industry to establish interoperable standards for telemetry, load signaling, and sub-synchronous oscillation mitigation.
"Together, we can design for a future where AI training is not only powerful, but also power-aware," the authors conclude. ®
Get our [13]Tech Resources
[1] https://arxiv.org/abs/2508.14318
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aKjoddEybkErEIMKXX7i2AAAAQo&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aKjoddEybkErEIMKXX7i2AAAAQo&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aKjoddEybkErEIMKXX7i2AAAAQo&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aKjoddEybkErEIMKXX7i2AAAAQo&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[6] https://www.theregister.com/2025/06/03/schneider_electric_says_us_grid/
[7] https://www.energy.gov/articles/doe-releases-new-report-evaluating-increase-electricity-demand-data-centers
[8] https://www.theregister.com/2025/08/22/visual_studio_copilot_muzzle/
[9] https://www.theregister.com/2025/08/22/georgia_datacenter_pushback/
[10] https://www.theregister.com/2025/08/22/commonwealth_ban_chatbot_fail_rehiring/
[11] https://www.theregister.com/2025/08/21/anthropic_claude_nuclear_chat_detection/
[12] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aKjoddEybkErEIMKXX7i2AAAAQo&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[13] https://whitepapers.theregister.com/
We could call it
Load balancing.
Then boast we've invented a never-before seen concept, where we even manage to lower total costs by ensuring that all the GPUs (and everything else) run at a continuous load without any down-time.
Or, and hear me out on this, we could even break the work up into chunks that can be worked on independently within one GPU and then start them off with a slight delay between each one, and, now this is really wild, give each GPU the next work packet (neat name, huh?) as well, so if it ends early it can carry on doing something until the time comes for its communication phase.
I'd like to call this idea "scheduling", if everyone is in agreement? And because each GPU is doing the same job but at slightly different times, maybe they are "asynchronous" but still sort of "processing in parallel"?
Great, great, we'll issue the press release tomorrow. Just want to let you guys know, I love working with you, inventing new solutions that nobody has ever thought about before.