Advanced Computing in the Age of AI | Thursday, March 28, 2024

High Frequency Traders Hedge Bets With IBM Power 

Banks, hedge funds, high frequency traders, and other players in the capital markets have a need for speed. IBM thinks that its Power-based systems are better suited for running their applications than the current crop of X86 systems. That is why some equities trading desks in New York and London are putting Power7+ machines through the paces to see if Big Blue is right – and they are looking ahead to Power8 iron coming around the middle of this year, too.

When IBM decided to sell off its X86 server business to Lenovo a month ago, the company made it clear that it remained committed to its homegrown Power processors and the systems that are built using them. IBM wants to do a better job selling Power-based machines against X86 iron, and one lucrative place where it wants to start is in the capital markets. Specifically, say IBM and key partners, the Power platform can carve out a niche in high frequency and other trading systems.

In 2013, $26.8 billion of IBM's total $99.8 billion in revenues came from the financial services sector, which is the largest revenue stream that IBM has from any single vertical. IBM obviously has a presence in the datacenters at the largest financial institutions on the planet, so if it can demonstrate a performance advantage running trading platforms compared to X86 systems, then it should have a relatively easy time breaking into this market. To that end, for the past year, Ravi Arimilli, a former Power chip designer and Systems and Technology Group CTO who is also an IBM Fellow, has been working with Terry Keene, CEO at financial services consultancy Integration Systems, to figure out how to retool high frequency trading programs to run on Power-based systems instead of overclocked systems based on Intel Xeons or AMD Opterons – and do so with fewer nodes, less jitter, and higher performance.

"In capital markets, there is lots of scale, meaning they have lots of servers," Arimilli explains to EnterpriseTech. "These companies typically have homegrown code, and they don't have a bunch of middleware in there – there is no SPSS, no HANA, no Oracle, there's just code running on the hardware directly. They have code that they need to grind against these engines, and they will push it as hard as they can. They really don't care about high availability because if they break something they have lots of servers, so it is no big deal. If you can improve their performance by 10 to 15 percent, they will pay you almost twice as much for the server. To them, that performance improvement is a big ticket because that is money on the table. Hardware, in this space, matters so much."

Perhaps equally significant is the fact that Arimilli believes that what is happening in capital markets is a precursor to the kind of hardware platforms that retail, ecommerce, and other industry sectors will want to deploy as analytics before, during, and after transactions start to matter as much as the transactions themselves. "It is not just capital markets," he says. "There is a push for a new class of servers for jobs that feel like capital markets, and companies that would not buy expensive hardware will now move in that direction for their new plays as a company."

Way back in the day when high frequency trading was young, says Keene, Digital Equipment was the vendor of choice for capital markets trading, and then Sun Microsystems knocked Digital out in the early 1990s. And in the early 2000s, it was taken over by Hewlett-Packard, mainly because the company could get sixteen blade servers into an enclosure, which was more than IBM or Dell could do at the time. "As funny as that sounds, density is incredibly important to capital markets," says Keene.

Over the past several years, Supermicro, Penguin Computing, SGI, Appro (now part of Cray), Dell, Ciara Technologies, among others, have all created overclocked servers based on either single-socket Core i5 or i7 processors or on two-socket Xeon E5 processors. (Supermicro was showing off its low latency HFT Hyper-Speed servers late last year at The Trading Show in New York, which are based on the eight-core Xeon E5-2687W processors running at 3.4 GHz with FPGA-accelerated Ethernet cards from Solarflare installed.)

While low latency is important with high frequency trading, so is getting deterministic – meaning predictable and consistent – performance out of a machine. The incremental cost of that next microsecond shaved off latency is huge, and that is why Keene says capital market players are starting to think less about high frequency trading and more about what he is dubbing high intelligence trading – where analytics is built into the same box that is doing the trades to not only cut down on latency, but to simplify infrastructure. (Precisely what IBM and Integration Systems are working on in this regard, neither Keene nor Arimilli will say, except that it will become more clear when the Power8 systems are launched around the middle of this year.)

High frequency trading is not as simple as overclocking processors that do trades anymore. For one thing, AMD has lost the clock speed war with Intel on the X86 front, and with the "Haswell" generation of processors, only the K series of Core i5 and i7 processors from Intel are officially supported as overclocked devices. There are no current Xeons that support overclocking, although there are some BIOS tricks that allow traders to jack up the cores to their Turbo Boost speeds and hold them there if they turn off all of the energy-saving features on the chips. (That said, if you are a big enough customer, as eBay is, you can get special high-frequency SKUs from Intel, which eBay has been able to do for the machines in its new datacenter in Utah.) With the Core i7 K series chip, says Keene, you can have four cores running at 4.5 GHz, or two running at 4.8 GHz, and even one running at 5 GHz, but he laughs and says you need liquid nitrogen to cool it. Moreover, these chips burn out in 30 to 60 days and they are not under warranty, either.

"We started talking to these guys, and what they tell us is that their biggest challenge is jitter," Keene explains. They don't want to have to do overclocking with expensive cooling, although some are doing modest overclocking in their trading systems. (The Lucera financial services cloud that launched last week does modest overclocking on machines from Scalable Informatics, for instance.)

"Let's just talk about going into one of the exchanges because that is an easy example," says Keene. "Right now, the London Stock Exchange touts itself as having the fastest exchange on the planet, at under 125 microseconds. So if I have written an algorithm to do trading at under 125 microseconds and that is how I expect to take data in and make decisions, and that's how I do trades, if all of a sudden a trade comes in at 3 or 4 milliseconds, you have just blown my algorithm and I could have just lost a million dollars. And jitter happens inside the HFT box as well. If the algorithm is supposed to execute in 20 microseconds, and if all of a sudden something inside the box causes some kind of queuing delay and I can't execute my trade for 100 or 115 microseconds, then I have lost my trade and I have lost all of the money I was going to make. So jitter is turning out to be almost as important as HFT, and in fact, I have had a number customers tell me that they will trade off latency to reduce jitter and increase density."

Trading applications are generally written in C with a dash of assembler here and there, but Keene says some applications use Java instead of C, sacrificing performance. "This was a bit surprising to us, but the beauty of Java is that you can change the algos really quickly," says Keene. Some trading shops have gone so far as to create their own garbage collection routines for the Java virtual machines – although some, like Priceline.com for its homegrown caching system for hotel reservations, are discovering the Zing JVM from Azul Systems, which has a clever pauseless garbage collection routine that removes some of the jitter in the system.

The thing to remember is that the coders who write trading applications are some of the smartest programmers on the planet, and they have intimate knowledge of the X86 platform, and for Xeon chips in particular. They optimize the Linux kernel themselves to reduce contention and jitter on the cores that run the operating system and they have other routines they create on the remaining cores in the systems to keep them running at 90 percent or higher CPU utilization at all times, again to get that deterministic performance. Some of this is dummy work, and some of it is real work. It is easier to give the machine dummy work to make it perform in a steady fashion than to let the chip's own power-saving and load balancing features take over and accidentally introduce latencies and jitter in the overall system. In a lot of cases, the applications are written in C or C++ and trading desks typically use the open source GNU compilers, says Keene.

This makes porting the code from a Xeon to the Power7+ chip fairly easy, since the dual-socket PowerLinux 7R2 servers support Linux and the GNU compilers. But just doing a simple port and recompile doesn't get the optimum performance out of the Power7+ chip, which in this case is running at 4.6 GHz across its eight cores. And doing so with air cooling, not with liquid or nitrogen cooling.

Here is an example of the kind of chip architecture and tuning difference that is significant for HFT shops that are looking to use Power instead of X86. Intel's Xeons have L3 caches that are shared across the cores, with 2.5 MB or 3 MB per core on a round-robin double ring loop. The access times to those caches are, give or take, on the order of 25 nanoseconds, according to Keene, depending on how far a core is from a cache segment. A lot of HFT shops do not take advantage of cache because it is not a huge advantage. But on a Power7+ chip, you get 80 MB of L3 cache, with 10 MB per core of near cache, attached right to each core with an access time of maybe 5 nanoseconds, and remote cache that can be accessed at maybe 25 nanoseconds. The trick for tuning these HFT applications, then, is to get data down into that 10 MB cache segment so it can be chewed on locally by each core. Now, L3 cache can provide more local data and fast access to it.

The Power7 and Power7+ chips can be overclocked, by the way, according to Arimilli and Keene, but neither is suggesting that HFT customers do that because they would have to do something other than air cool the systems. (Assuming that vendors push the chip at about 80 percent of its capacity, a Power7+ processor could hit perhaps as high as 5.5 GHz to 5.7 GHz, but doing so would introduce cooling issues, would invalidate warranties, and perhaps introduce jitter.)

Without resorting to overclocking, IBM and Integration Systems have been able to show in the initial proofs of concept in New York and London that the work being done by 200 two-socket "Ivy Bridge-EP" Xeon E5-2600 v2 servers can be done by around 100 two-socket Power7+ nodes. This is just with a raw recompile of the trading applications. And with code optimization to take advantage of cache, threads, registers, and other features of the Power architecture,  the consolidation could be pushed down to around 80 nodes, according to Arimilli. With the Power8 systems coming later this year, IBM will be able to have around 40 nodes do the same work as those 200 X86 systems cited above.

"It is important to remember that this is for applications in the capital markets," says Arimilli. "You may point out that if you look at SPEC integer processor benchmarks, you would not expect this kind of reduction, but those tests are not capital market applications. Those applications stress the memory, the queues, the fabric, and other features, and on these specific applications we can do 2:1 compared to X86 now with Power7+ and another 2:1 with Power8."

EnterpriseAI