NREL’s New High-Performance Computer Has a Tremendous Supporting Cast
The arrival of Kestrel, the U.S. Department of Energy (DOE) Office of Energy Efficiency and Renewable Energy’s (EERE’s) newest—and fastest—supercomputer, is a massive event, and the coordination and teamwork involved in bringing it to the National Renewable Energy Laboratory (NREL) is a massive undertaking. With high-performance computing (HPC) that is more than five times the pace and scale of NREL's Eagle supercomputer, it took a coordinated team to make sure that Kestrel’s build and arrival went smoothly. And it is here.
Envision the Possibilities
When Advanced Computing Operations Group Manager Aaron Andersen arrived at NREL, planning for was well underway, but he was tasked with evaluating the proposals along with a technical evaluation team for NREL’s next supercomputer.
To build such a system takes imagination, as the hardware—while planned—did not yet exist. Andersen worked with the technical evaluation team to review projected performance from multiple vendors.
“It’s challenging because often they are benchmarking on hardware that doesn't exist yet, so they will provide current generation processors and memory configurations, and then they'll provide some kind of estimate as to how much faster the new system will be,” he said.
They ran benchmarks, determined what was reasonable and what was not, and moved forward to award a contract to Hewlett Packard Enterprise (HPE) to build a system that met NREL’s needs. Andersen and a team later visited the HPE factory in Chippewa Falls, Wisconsin, where they ran the benchmarks to ensure performance requirements were met.
Changes to the data center were essential to get Kestrel running. “We needed 2.5-megawatt upgrades to support Kestrel,” Andersen said.
After supply chain delays, the NREL team convened on a snowy March morning to greet five semitrailer trucks carrying Kestrel’s components. The operations team began moving equipment from the trucks into NREL's Energy Systems Integration Facility (ESIF) HPC data center. They moved Kestrel’s central processing unit (CPU) nodes into the facility and conducted early power and cooling tests. They cabled the system and connected compute nodes to the high-speed network and storage.
It Takes a Village
A system like Kestrel does not just show up and get plugged in. It takes many people doing different things—from running power lines to plumbing for cooling—to get the supercomputer up and running. Andersen said that the team worked with HPE to iron out the details of the number of cabinets and circuits per cabinet, how water cooling integrates with the facility, and various other technical details.
Surendra Sunkari, and NREL HPC engineer who was involved in the procurement effort, led the technical installation and will be lead operator for the system’s life cycle.
“He’s done an outstanding job and has shown great leadership,” Andersen said. “The Advanced Computing Operations team assigns a team lead as the technical point of contact for the duration of our major HPC systems. For Kestrel, Surendra Sunkari has taken on that role. From the initial phases of requirements development, technical proposal evaluation, and the integration of the system into ESIF, Surendra brings an attention to detail, organizational skill, and passion for the work that I would compare to the very best I have worked with in my career.”
Engineer and system integration coordinator Roy Fraley led integration efforts to tie Kestrel into ESIF, mainly focusing on power and cooling infrastructure with a lesser focus on network integration.
“I worked with ESIF operations and David Sickinger on piping," Fraley said. "ESIF operations ran with the design and construction of the hydronic lines or cooling loops.” He highlighted the concerted efforts of many NREL staff, including “Trevor Dziedzic, [who] was instrumental in hydronics, cooling, and plumbing and piping design. Steve Mager got power work completed. Glenn Powell, Tyson Frank, and Jon Veenstra were all very helpful with whatever we needed. Rob Russell led site operations, and chief engineer Mike Feller coordinated the system fill to make sure we did not trip any alarms. Then, there’s Encore Electric and a team of electricians, Braconier providing piping and cooling capacity, Ben Krech providing safety training to HPE folks, and Janet Quinn who was instrumental in staging the five trucks so they were not in the way of operations.”
Computational science researcher Tim Kaiser was one of several people who helped evaluate vendor submissions and developed benchmarks. Kaiser recognized the project and lab management efforts of Kris Munch, the laboratory program manager for advanced computing, and Jennifer Southerland, a computational science project manager, along with “Kinshuk Panda, who organized efforts for working with users. Clara Larson did a lot with benchmarking.”
You Break It, Then You Build It
“One of my jobs was to try to break the preproduction hardware that we were given," Kaiser continued. "The goal is to find issues and get them fixed before Kestrel comes online.”
They practiced setup to make sure the real thing went smoothly.
“We talked through how all the equipment would be laid out on the trucks, who was responsible for what, who played offense and defense, so to speak," Andersen said. "That had to be set up concurrently with Eagle—you need Eagle because people have to do research still while we get Kestrel up and running."
Now that the CPUs are in place and everything is connected, the team is reverifying benchmarks and introducing failures into the system to make sure the system works as intended. They are making sure the system does what is needed and that the power and cooling requirements are met. The build continues through this summer, including software testing, incorporating new equipment to power the system, and preparation for the imminent arrival of the graphics processing units (GPUs).
The Best Is Yet to Come
GPU nodes are key features of the new system.
“Roughly half of [Kestrel’s] capability will come from GPU nodes,” Andersen said. “Machine learning, deep learning, and artificial intelligence (AI) take advantage of those nodes. DOE invested a great deal in the exascale projects, which have software simulation models that are coming online that are GPU capable, so there’s a new capability. GPUs allow researchers to offload math-intensive components of an algorithm to the GPUs and sometimes get 20 times better performance.”
The GPUs are supposed to arrive in late August, when snow is unlikely to be in the forecast—but it is Colorado.
“Everyone's highly focused on the effort and doing a great job,” Andersen said. “There’s an amazing support cast and great leadership, and that has made this a successful endeavor.”
Read more about Kestrel and follow its story.