Inside Advanced Scale Challenges|Monday, December 10, 2018
  • Subscribe to EnterpriseTech Weekly Updates: Subscribe by email

IBM Accelerates Power8 Clusters With GPUs, FPGAs, And Flash 

It is perhaps a lucky stroke of timing or perhaps by design that only days after Big Blue sold off its System x X86 server business to Lenovo Group for $2.1 billion that the company is coming out swinging with Power8 servers that are augmenting their performance using a variety of adjunct co-processors and flash storage. But ahead of next week's Enterprise2014 event in Las Vegas, where it will be talking about its increasing focus on Power Systems and System z mainframes, the company is launching a number of systems that are designed to take workloads away from X86 clusters.

As EnterpriseTech has previously reported, IBM has been telling customers to expect larger Power8-based machines with more than two sockets as well as systems that would use field programmable gate arrays (FPGAs). IBM has also been hinting that OpenPower partner and GPU coprocessor maker Nvidia would be working together to get a Power8-Tesla hybrid system into the field before the end of the year.

It turns out that IBM is launching three different systems tuned up for three different kinds of workloads that are based on its "scale-out" Power8 systems. By scale-out, IBM means a system is designed with one or two sockets and is intended to be used in clusters that have distributed applications that scale their capacity by adding multiple nodes in a loosely coupled fashion. This is distinct from "scale-up" machines, which more tightly couple server nodes and their main memory together, usually using non-uniform memory access (NUMA) technology, to create what is in essence a single large processor to run fat applications or their databases.

Big Blue is also rolling out scale-up versions of its Power8 systems, which it has also promised would come this year, ahead of the Enterprise2014 event. So don't think the Power8 rollout is only about creating a Power8 alternative to the workhorse, two-socket server based on Intel's Xeon E5-2600 processors. (We will report on these NUMA machines, which are called the Power Enterprise Systems, in a separate story.)

The new Power S824L is a Linux-only version of the existing Power S824 machine that IBM announced back in April. It is a two-socket machine that comes in a 4U chassis that has room for a dozen 2.5-inch disk drives and eleven PCI-Express 3.0 peripheral slots. This is not the skinniest of GPU-accelerated servers out there in the market by far, but IBM is betting that the memory and I/O bandwidth of the Power8 machines will give it a performance advantage compared to Xeon E5 servers using the same Tesla accelerators.

The Power S824L can be equipped with two different processor options. The first is a pair of ten-core Power8 dual-chip modules (for a total of 20 cores) that run at 3.42 GHz, and the second is a part of twelve-core dual-chip modules (for a total of 24 cores) that run at 3.02 GHz. With the scale-out variants of the Power8 processors, IBM has created a dual-chip module that supports 48 PCI-Express lanes per socket with up to 32 of these lanes being able to be configured to use its Coherent Accelerator Processor Interface (CAPI) ports on the Power8. With CAPI, an accelerator based on GPUs, DSPs, or FPGAs that resides on a PCI card can link into the Power8 processor and memory complex and look like what is in effect a "hollow core" that has the same access to the memory hierarchy as the actual Power8 cores. What this means is that these accelerators do not have to move data back and forth between the CPU and the accelerator; both devices address the same memory space. Anyway, IBM created a six-core chip module with lots of PCI-Express lanes and CAPI ports and then put two of them in a single socket for the scale-out Power8 machines precisely because it wanted to be able to put lots of accelerators on them. The Power8 chip that is used in the scale-up NUMA machines in the enterprise class put a dozen cores on a single die and these have a larger memory capacity per socket and fewer PCI-Express 3.0 lanes and therefore fewer CAPI ports. (32 PCI-Express lanes, 16 of which can be used by CAPI, to be precise.) The machine delivers 96 GB/sec of I/O bandwidth per socket, and has six x8 slots available for other peripheral attachment.

The Power S824L system uses IBM's high-end and custom memory cards, which have its "Centaur" memory buffer chip on them and which have 16 MB of L4 cache memory sitting between the processor and the main memory. The Power8 chips have 512 KB of L2 cache per core and 96 MB of shared eDRAM L3 cache across the cores on the die. (On the scale out machines, it is 48 MB per die and two dies per socket). The system supports up to 16 DDR3 memory slots running at 1.6 GHz and delivering 384 GB/sec of aggregate memory bandwidth across the two sockets. The system supports 16 GB, 32 GB, and 64 GB memory sticks and tops out at 1 TB of capacity. (A whopping 128 GB memory card that IBM is making available for the generic Power S824 server that can run AIX, IBM i, or Linux is not available on the Power S824L Linux-only system.)

The GPU-enabled Power8 system comes with one Tesla K40 adapter installed in a PCI-Express 3.0 x16 slot with a second as an option in another x16 slot. The machine has a total of four x16 slots and the remaining two can be used to host CAPI-enabled coprocessors of some kind. At the moment, the only option is a CAPI-tweaked version of an FPGA card from Nallatech, which is partnering with IBM to create a CAPI software development kit.

According to Stefanie Chiras, director of the scale-out Power Systems at IBM's Systems and Technology Group, the Tesla GPUs connect over normal PCI-Express x16 links and do not use the CAPI interface. However, through the OpenPower Foundation, IBM and Nvidia are working together to bring Nvidia's very clever NVLink interconnect, which the GPU maker previewed back in March for the future "Pascal" family of GPUs, to future Power chips. It is not yet clear how this integration will work, but should it come to pass, IBM will be able to have very tight links between its Power chips and multiple Nvidia coprocessors as well as between those coprocessors themselves. The NVLink technology is expected to come to market in 2016, and will very likely see integration with the future Power8+ processor from IBM and the OpenPower Foundation.

The Tesla K40 coprocessor is the fastest one that Nvidia makes, and it can significantly goose the number-crunching performance of a Power8 chip, which is no slouch by itself. The Tesla K40 has 2,880 CUDA cores and is based on the "Kepler" GK110B GPU. It is rated at 1.4 teraflops at double precision and 4.29 teraflops at single precision. With GPU Boost overclocking, the performance can be boosted another 20 to 30 percent, depending on the thermal overhead in the system. This Tesla card has 12 GB of GDDR5 buffer memory and delivers 288 GB/sec of memory bandwidth on the card with error correction turned off. The card takes up two PCI slots of space in the server (but only one PCI-Express x16 slot) and has a maximum power draw of 235 watts. By comparison, a Power8 chip is widely believed to have a thermal design point of around 250 watts (IBM does not provide this number) and each Power8 core can do four double precision floating math operations per clock. So a twelve-core Power8 socket with cores running at 3.02 GHz should deliver around 145 gigaflops at double precision. The GPU provides roughly ten times the math oomph.

It is not clear what level of the CUDA parallel application development environment this Power S824L machine supports, but it does only run the forthcoming Ubuntu Server 14.10 from Canonical. This Linux operating system is running in bare-metal mode, without the PowerVM or PowerKVM hypervisors underneath it to give the top performance possible.

Pricing was not available at press time, but the Power S824L system with Tesla coprocessors is available on October 31.Generally speaking, Chiras tells EnterpriseTech that at the system level, a Power-Tesla combination should offer about a 20 percent price/performance advantage compared to a similar Xeon server equipped with the same Tesla K40 coprocessors. IBM is in the process of helping key software vendors in the technical and enterprise computing markets to port their codes to Power-Linux and enable them for GPU offload for heavy calculations. IBM is working to get its Java stack and its DB2 relational database with BLU Acceleration to also offload routines to the GPUs. No word on when any of this might be available.

Using Flash As Slow Memory

GPU coprocessors are not the only way to accelerate workloads, and IBM is using a number of different technologies in unison to create bundles of hardware and software based on Power8 systems it is calling Data Engines.

The Data Engine for NoSQL marries a two-socket Power S822L server with a FlashSystem 840 all-flash array to run the Redis NoSQL data store. This machine has the same processor options as the Tesla accelerated Power S824L above, only it comes in a 2U enclosure with fewer drive bays and PCI-Express 3.0 expansion slots.

Redis is an in-memory NoSQL data store, and you scale out the performance of Redis by beefing up the main memory in server nodes and by spreading data across an ever-increasing number of nodes. DRAM main memory may have come down a lot in price over the past several years, but flash memory is still a lot less expensive and is considerably faster than disk storage. So IBM is hooking a flash array into the Power8 server and using a CAPI interface between the processor and the flash array, implemented in an FPGA, to link the flash memory directly into the Power8 processor complex. We were talking to Steve Sibley, director of worldwide product management for IBM’s Power Systems division, and suggested that it looked like IBM was using the FPGA linked through CAPI as a kind of extended memory controller that allows the flash memory to be address as a kind of slow main memory and he concurred that this was a good way to look at it.

ibm-data-engine

The main thing is that the flash memory in the external FlashSystem 840 array is directly addressable by the Power8 processor and that depending on the Redis workload, only somewhere between 10 and 15 percent of the data in a Redis data store is actually hot enough to require placement in main memory. So, if you had a 12 TB data store and the typical X86 server tops out at 384 GB or 512 GB of memory using reasonably dense but not too expensive memory sticks. So you need 24 servers plus a few backups to store that 12 TB data store in Redis. With the Data Engine for NoSQL setup IBM has cooked up, the I/O operations per second between the flash and the server are boosted by a factor of 7X or so because the entire PCI-Express, Fibre Channel, and driver stack is eliminated and the flash runs fast enough to be addressed as a kind of slow memory over CAPI. So a FlashSystem 840 with a dozen 1 TB flash modules can host that 12 TB of data and the single server node only needs 256 GB of main memory.

In this example, 24 server nodes are collapsed down to one server and one flash array with equivalent performance (call it a 24 to 1 reduction because a Redis cluster has backup nodes and load balancers) and the overall cost of the infrastructure is reduced by a factor of 2.4, according to IBM. The resulting setup burns one-twelfth the amount of energy (about 1,500 watts versus 18 kilowatts) and takes up one-twelfth the space. For a larger Redis data store, say at the maximum 40 TB capacity of the FlashSystem 840, an X86 cluster would be 3.3 times as expensive as the Power8-FlashSystem combination. Generally speaking, IBM is telling customers to expect the Data Engine for NoSQL to cost about a third of what a standard X86 cluster would cost to support the same Redis data store with about the same performance.

The Data Engine for NoSQL will be available on November 21. It runs its software stack on Ubuntu Server 14.10 as well. The Redis software comes bundled on the box but you have to get licenses from Redis Labs, the commercial entity behind the NoSQL data store.

A few weeks later, on December 5, IBM will ship the third gussied up version of its Power8 scale-out systems, which is called the Data Engine for Analytics. This cluster is also based on the Power S822L server and it also has an FPGA embedded in it. But in this case, that FPGA is used to provide Gzip data compression and decompression for a Hadoop analytics stack in the compute nodes. Other Power S822L nodes run the controllers for IBM's Elastic Storage software, which is a variant of the GPFS parallel file system with software RAID and other enhancements, and the Platform Symphony Java messaging middleware. The compute nodes run Red Hat Enterprise Linux 6.5 and load up IBM's own BigInsights variant of Hadoop. The Elastic Storage superset of GPFS replaces the Hadoop Distributed File System (HDFS) and Symphony replaces the MapReduce layer in Hadoop. The result is a Hadoop stack that has significantly better performance on batch jobs than the standard Hadoop stack running on X86 clusters. How much more, Sibley is not at liberty to say because the final benchmark tests on this setup have not been completed.

What Sibley can say is that the Data Engine for Analytics cluster will require one third less raw storage for a given amount of data than a plain vanilla Hadoop cluster with its triplicate copying of data because Elastic Storage has other data protection built in that does not require such capacity overhead. Pricing for the Data Engine for Analytics has not been set yet.

4 Responses to IBM Accelerates Power8 Clusters With GPUs, FPGAs, And Flash

  1. Xinghong He

    A little correction: each Power8 core can do eight double precision floating math operations per clock, not four.

     
    • Timothy Prickett Morgan

      From the IBM Power System S822 Technical Overview and Introduction
      at http://www.redbooks.ibm.com/redpapers/pdfs/redp5102.pdf

      “An integrated, multi-pipeline vector-scalar floating point unit for running both scalar and SIMD-type instructions, including the Vector Multimedia eXtension (VMX) instruction set and the improved Vector Scalar eXtension (VSX) instruction set, and capable of up to eight floating point operations per cycle (four double precision or eight single precision)”

      I read that to mean four DP per core.

       
      • Xinghong He

        Yes, I’d read the same too.

        This integrated unit has two pipelines each can handle a two-way SIMD floating-point DP instructions (including FMA). In the case of FMA, it translates to 2x2x2=8 DP FLOPs. This feature was introduced from POWER7.

         
  2. Ed B.

    Typo:

    “The Power8 chips have 512 KB of L2 cache per core and 96 GB of shared L3 cache across the cores on the die.”

    That should read “96 MB of shared L3 cache”

     

Add a Comment

Do NOT follow this link or you will be banned from the site!
Share This