The US National Center for Atmospheric Research (NCAR) in Boulder, Colorado, has selected its next supercomputer. The new machine will, it says, help its scientists conduct research needed to better understand a range of phenomena that affect society, from the behavior of major wildfires to eruptions of solar storms that can threaten GPS and other sensitive technologies.
The system will be built by Hewlett Packard Enterprise (HPE) and installed this year at the NCAR-Wyoming Supercomputing Center (NWSC) in Cheyenne, Wyoming. It will become operational in early 2022 and will replace the existing system, known as Cheyenne.
The HPE-Cray EX supercomputer will be a 19.87-petaflops system, meaning it will have the theoretical ability to perform 19.87 quadrillion calculations per second. That is almost 3.5 times the speed of scientific computing performed by the Cheyenne supercomputer. Once operational, the HPE-powered system is expected to rank among the top 25 or so fastest supercomputers in the world.
“This new system is a major step forward in supercomputing power, providing the scientific community with the most cutting-edge technology to better understand the Earth system,” said Anke Kamrath, director of NCAR’s Computational and Information Systems Laboratory.
“The resulting research will lead to new insights into potential threats ranging from severe weather and solar storms to climate change, helping to advance the knowledge needed for improved predictions that will strengthen society’s resilience to potential disasters.”
Since the NWSC opened its doors in 2012, more than 4,000 users from more than 575 universities and other institutions across the nation and overseas have used its resources. Last year, the NWSC joined the Covid-19 High Performance Computing Consortium to accelerate understanding of the novel coronavirus.
“The timing and nature of the NCAR upgrade could not be better for Wyoming. Researchers at the University of Wyoming will make great use of the new system to better understand areas of fundamental and economic interest impacted by flows in the atmosphere and underground,” remarked Ed Synakowski, vice president of research and economic development at the University of Wyoming. “The advances in computing that are captured in this upgrade, and the potential for impactful application of its results, are tremendous. We look forward to working with NCAR and the National Science Foundation in using this increased capacity to advance the fundamental science that determines so many issues of potentially high economic and social importance.”
One of the most innovative features of the new system is its use of accelerated computing with Nvidia A100 Tensor Core graphics processing units (GPUs). The supercomputer will get 20% of its sustained computing capability from GPUs, with the remainder coming from traditional central processing units (CPUs).
GPUs offer significant advantages over CPUs for Earth system research. They are far more powerful and energy efficient than CPUs, with up to six times the performance (as measured by floating-point operations) per watt of energy than CPUs. Adaption of GPU computing will also position the NWSC for the eventual use of exascale computing, which is many times faster than the most advanced systems today.
GPU computing is also more effective for newly developed artificial intelligence and machine learning techniques because they perform large numbers of computations simultaneously on one accelerator, resulting in lower power usage and less hardware for the same number of parallel operations.
As a result of the GPUs and other energy-efficient features, the new NWSC system will use just 40% more electricity than Cheyenne – which is itself highly energy efficient – despite being almost 3.5 times faster.
The system will have 60 petabytes of high-performance storage, almost double the capacity of Cheyenne. It will feature HPE Slingshot, a purpose-built networking solution developed for high-performance systems to address demands for higher speed and congestion control for data-intensive workloads.
Key features of NCAR’s new supercomputer
- 19.87-petaflops powered by HPE Cray EX supercomputers, which are engineered to support next-generation supercomputing, including exascale systems;
- 2,570 compute nodes total: 2,488 homogeneous compute and 82 heterogeneous (GPU) nodes
- Homogeneous nodes have 2x 3rd Gen AMD EPYC CPUs
- Heterogeneous (GPU) nodes have 1x 3rd Gen AMD EPYC CPUs and 4x Nvidia 1.41GHz A100 Tensor Core GPUs with 40GiB HBM2 memory and a 600GB/s Nvidia NVLink GPU interconnect.
- 692 terabytes (TB) of total memory.
- HPE Slingshot (v11) high-speed interconnect in a Dragonfly topology.
- Homogeneous compute nodes have one Slingshot injection port and the GPU nodes have four Slingshot injection ports per node.
- HPE Slingshot bandwidth is 200Gb/sec per port per direction.
- HPE Slingshot MPI latency is 1.7-2.6 usec.
- Eight login nodes, each with 512GB DDR4-3200 memory.
- six nodes with 2x AMD EPYC 7742 CPUs.
- two nodes with 2x AMD EPYC 7742 CPUs and 2x Nvidia V100 GPUs.
Software Environment
- HPE Cray Operating System (OS), a tuned version of SUSE Linux.
- Altair Accelerator Plus scheduler with PBS Professional Workload Manager.
- Support for Docker containers, Singularity containers and containers that support the Open Container Initiative standard.
- HPE Cray Programming Environment, support for OpenMP 4.5 and 5.0, and MPI v3.1.
- Performance analysis and optimization tools in the HPE Cray Programming Environment to improve performance of applications.
- Nvidia HPC SDK, a comprehensive set of compilers, libraries and tools for the accelerated platform.
- Intel Parallel Studio XE compiler suite.
- Cray ClusterStor E1000 storage system from HPE (based on 2.12 LTS).
The new NWSC-3 supercomputer and the existing NWSC GLADE file systems are complemented by a new parallel file system and data storage components.
Key features of the new data storage system:
- Six Cray ClusterStor E1000 storage systems from HPE.
- 60 petabytes of usable file system space (can be expanded to 120 petabytes by exercising options).
- 300GB per second aggregate I/O bandwidth to/from the NWSC-3 system.
- 5,088 × 16TB drives.
- 40TB SSD for Lustre file system metadata.
- Two metadata management units (MDU) exporting four MDTs (one MDT exported per one MDS), configured in highly available storage pairs.
- Cray ClusterStor E1000 storage file system from HPE.