Fat Xeons Best Accelerated Systems For Option Pricing
Sometimes you can throw hardware at a problem and speed up an application, and other times, if you delve into the software stack and make changes, that works. Sometimes, you can do both and come out even further ahead. Such is the case with a new set of benchmarks run on the latest "Ivy Bridge" Xeon processors from Intel, which were able to significantly speed up options pricing on a widely used set of benchmarks developed by the financial industry.
The important thing for the Securities Technology Analysis Center (STAC) benchmarks is that the tests are created by the big financial firms, not by the vendors. The benchmarks that STAC has created run the gamut of the financial services industry, including simulations of feed handlers, data distribution, tick analytics, event processing risk computation, backtesting, trade execution, tick-to-trade, and other functions commonly performed by banks, brokers, exchanges, hedge funds, and others involved in securities trading. The risk computation test, which does a Monte Carlo simulation, is called the STAC-A2 test, and it has been used on a variety of different systems to show how quickly they can compute the risk on American-style options.
An option is an example of a derivative, and in this case, it is the right to buy or sell a particular financial asset before the expiration date. Knowing when to buy or sell such options is obviously key to trading them and the most popular one is to use a Monte Carlo method, which provides a probabilistic distribution of possible prices for an asset based on a variety of inputs. The presumption is that with more inputs you get better risk analysis, and therefore, know better when to buy or sell the options on the asset. Adding more inputs and calculating the sensitivity of an asset to the interest rate or the underlying price of the asset or its inherent volatility, to name some of the properties in the STAC-A2 test, takes more compute power and more time to perform the calculations.
What financial services firms would like to be able to do is calculate risks in real-time, and therefore, be able to do their trading in real-time. While precise numbers are not known, the risk analysis engines of major financial firms typically have thousands of server nodes. It is one of the biggest jobs that such firms have, in fact.
The STAC-A2 test measures the speed, scaling, power and space efficiency, and the quality of the analytics involved in risk analysis for Monte Carlo simulations for options pricing. Technically speaking, the STAC-A2 test does a Monte Carlo estimation of Heston-based Greeks for a path-dependent, multi-asset option with early exercise. (You can get all of the gory details about how the test works here.) The benchmark is fairly new, having only been proposed at the SC12 supercomputing conference in November 2012. The benchmark code is written in C++, which gets it very close to the iron and therefore helps ensure high performance.
The STAC-A2 test measures a lot of aspects of the performance of the Monte Carlo simulation, but the main one is the time it takes to do a set of Greeks – measures of the option price sensitivities to outside factors. The test can be scaled up by increasing the number of assets in the option portfolio or increasing the number of paths of the asset price calculated by the Monte Carlo simulation. (A path is literally a line chart of the asset price over a set time based on an initial set of inputs; for each change in inputs, you calculate a new path for the asset price.) The baseline number of paths used to calculate the Greeks in the STAC-A2 test is 25,000 and the baseline number of assets is five. You can scale the paths a lot higher than you can scale the asset portfolio. Recently tested machines have millions of paths, but it is hard to double the asset basket size. The complexity of the calculations also increases as the number of timesteps – typically corresponding to a day of trading – that is simulated. The baseline STAC-A2 test has a one year option on the asset (252 timesteps, one for each trading day), and if you push beyond that, it takes more compute, more time, or both. The scaling of assets and paths is measured over a 10 minute period, and when you scale up one factor, the other two are held constant at their baselines.
Here is the interesting lesson to derive from the STAC-A2 tests thus far, as David O'Shea, strategic relationship manager for the financial services industry at Intel, explains it to EnterpriseTech. Sometimes changing the way a benchmark (and therefore an application) is implemented deep in the guts of its software is as important as throwing more hardware at the problem. And the combination of better software and faster hardware is even better still.
Thus far, Intel has gone through three iterations of its implementation of the STAC-A2 test. Rev A used OpenMP for its application threading, and as O-Shea put it, what Intel was initially focused on was getting the implementation of the STAC-A2 test right. There was some optimizations of the code done with Rev A, but with Rev B of the code base, Intel's engineers worked to tune and optimize around OpenMP on various Xeon E5 systems in the "Sandy Bridge" generation. When "Ivy Bridge" Xeon E5 machines were available, this Rev B code was also run on a two-socket Ivy Bridge server with a Xeon Phi 7120P coprocessor.
"That threading model works really, really well when you have a single type of calculation – you are doing a binomial, you are doing a trinomial, you are doing Monte Carlo, you are doing Black-Scholes," explains O'Shea. "Where you are doing one kind of calculation and you are doing ten million of them. The Greeks is a different problem. There is not one type of calculation, but many calculations. What we discovered is that if we changed the threading model, we were able to improve the utilization of the hardware."
With the Rev C implementation of the STAC-A2 test, Intel shifted to the Threading Building Blocks (TBB) threading code and was able to significantly increase the performance of CPU-only systems running the STAC-A2 benchmark. TBB is a threading model that Intel has donated to the open source community. With the new TBB threading model, in fact, Intel was able to scale STAC-A2 across four-socket Xeon E7 machines and exercise all of the cores while at the same time getting better performance on two-socket machines as well. The OpenMP implementation, says O'Shea, could not really scale to four sockets, and indeed, earlier tests on a four-socket Xeon E5-4600 server showed only around a 25 percent performance increase on the time spent calculating the Greeks and the asset base could only expand by 20 percent; the maximum paths only increased by 6 percent.
"Internally, we did some tests, and we could have gotten more from OpenMP, but it was never going to be as fast as TBB," he adds. Something on the order of a 4X improvement moving from two-socket to four-socket machines using OpenMP instead of the 8X improvement that Intel is seeing with the TBB threading.
The TBB threading model only works with C++ and it has dynamic load balancing. Because of this, the nested parallelism that is explicit in the Greeks code that is part of the STAC-A2 test "just works," as Intel put it. Because multiple Greeks are running in parallel with each other during the test, using OpenMP would have required Intel to explicitly design, "with a paper and pencil," how the code operates across each and every core. This is a trial and error method until you get it right. Intel also improved the use of the vector coprocessors in the Xeon processors, which involved changing some of the benchmark code as well as moving to the Intel C++ Compiler XE 14.0 Update 2 that is part of the Intel Parallel Studio XE suite. Also, HyperThreading, Intel's implementation of simultaneous multithreading, helps boost performance when used with TBB but does not with OpenMP.
O'Shea says that most financial institutions companies writing such C++ programs are familiar with OpenMP, and they are not familiar with Intel compilers and they use other tools to create their applications. With such a performance boost, Intel is obviously hoping to sell a lot more Ivy Bridge iron and its software tools.
You can see the complete list of results of STAC-A2 benchmark tests at this link. The one Intel was particularly excited about was for a four-socket whitebox server made by Intel using four Ivy Bridge-EX processors. This machine was equipped with 15-core E7-4890 v2 processors running at 2.8 GHz, which is the top bin part. The machine had 1 TB of main memory and ran Red Hat Enterprise Linux 6.5, TBB 4.2 Update 3, and Intel's Math Kernel Library 11.1 and Compiler XE 14 Update 2. The mean end-to-end Greeks calculation time was 0.575 seconds, and having sub-second response time for that baseline calculation with five assets and 25,000 paths is important. With the 60 cores and 120 threads in the box, Intel was able to scale up the asset portfolio to 67 and the maximum paths to 13.5 million.
Up until now, the fastest and most scalable machine to run the STAC-A2 test was a workstation with two Xeon E5-2660 v1 processors (that's in the "Sandy Bridge" family) running at 2.2 GHz equipped with two Nvidia Tesla K20Xm GPU accelerators. That setup was running Nvidia's own Rev A code and it was able to calculate the Greeks in 0.77 seconds and the options assets could be expanded to 41 and the maximum paths to 8.5 million before the configuration started to choke.
The lesson here is that with the right software, a four-socket Xeon E7 server can best a two-socket Xeon E5 with two GPUs. There will no doubt be further optimizations that push up the performance of CPU-GPU and CPU-Phi hybrids.
Intel also just tested a two-socket machined based on its Xeon E5-2697 v2 processors (which have 12 cores each running at 2.7 GHz) using the Rev C code base of the STAC-A2 test, and then also just did another test yesterday on this same machine configured with a Xeon Phi 7120P parallel X86 coprocessor.
On the plain vanilla whitebox two-socket Ivy Bridge server, the mean end-to-end calculation time on the Greeks was 1.109 seconds, nearly twice as slow as on the four-socket system above. (This stands to reason with the TBB threading model spanning the system, where you would expect a two-socket box to take a little more than twice as long to process parallel jobs because it has a little less than twice the number of cores and runs a little bit slower on the clocks.) Holding the maximum paths constant at 25,000 paths, Intel could scale up to 47 assets from five on the STAC-A2 test in a ten minute term, and holding the assets constant at five, it could scale the paths up to 5.5 million. The GPU-accelerated machine was faster at calculating the Greeks and could do a lot more paths. It would be interesting to see the performance per watt and performance for dollar calculations on these systems to get a more fair comparison. STAC has just started collecting physical size and power draw stats on tested systems.
With a Xeon Phi 7120P coprocessor plugged into this same two-socket whitebox, the mean Greeks calculation time was 0.869 seconds, maximum assets hit 52 and maximum paths were stretched to 6 million. The Greeks calculation time went down by 21.6 percent with the addition of the Xeon Phi, the asset basket could be 10.6 percent larger, and there were 9.1 percent greater paths in the simulation.
The most interesting comparison is perhaps to take these new setups and compared them against an Ivy Bridge Xeon E5 system running the Rev B code using the OpenMP threading. That machine took 4.817 seconds to calculate the Greeks and maxxed out at 29 assets and 2.7 million paths. The hardware configurations were not precisely identical between the Rev B and C code, but that is about a factor of 4X improvement in reducing the Greeks calculation time, a factor of 2X more paths, and about a 60 percent increase in the asset portfolio by shifting from OpenMP to TBB threading and making all of the other changes in the STAC-A2 code.
It will be interesting to see if Wall Street starts thinking about four-socket Xeon E7 servers instead of Tesla or Xeon Phi acceleration for their two-socket Xeon E5 servers to boost Monte Carlo simulation and risk analysis. And think about what a four-socket box with accelerators, or even a machine with eight or sixteen sockets, might do.