Advanced Computing in the Age of AI | Friday, March 29, 2024

New NVIDIA GPU Drives Launch of Facebook’s ‘Big Sur’ Deep Learning Platform 

Facebook's Big Sur

Facebook continues to pour internet-scale money into Deep Learning and AI, announcing its new “Big Sur” computing platform designed to double the speed for training neural networks of twice the size.

NVIDIA’s new Tesla M40 GPU, introduced last month, is the chip of choice to power the Big Sur system, which was designed by Facebook AI Research (FAIR) beginning in mid-2014 to support development of more sophisticated models and new classes of advanced applications. The system includes eight Tesla M40s, packing total throughput of nearly 60 TF and 96GB of memory.

“We’ve been working (with Facebook) on Big Sur since the middle of last year,” Ian Buck, vice president of accelerated computing at NVIDIA, told EnterpriseTech, “giving them early access to the M40 and having our systems engineers help them fit the M40s inside of their platform. Putting that much horsepower in a single server is a challenging task. They’ve designed quite an elegant solution that can fully power and cool that number of GPUs and deliver maximum performance.”

Facebook is the first company to adopt the new NVIDIA accelerator, which will be used to perform the incredibly compute-intensive task of training deep neural networks to see, read, hear and understand specified objects, documents, images and videos.

“It is really just a whole other layer of optimization, performance and efficiency for doing the kind of numerical calculations necessary for machine learning,” said Buck of the Tesla M40. “It’s a full-performance GPU, it’s our fastest individual GPU, with the memory size needed for these workloads.”

The ultimate objective is to improve the ability of computers to comprehend and interact with people, serving as personal assistants, such as Google Now and Apple’s Siri, or putting forward products and services that Facebook members and Amazon shoppers may want to purchase. According to Buck, about 100 million images must be fed to a network before it can grasp as simple a task as recognizing a cat or a dog with 90 to 95 percent accuracy.

Ian Buck

Ian Buck

“In many cases the training doesn’t work the first time,” he said. “The network will get to about 70 percent accuracy and then it plateaus. Then the data scientists figure out why it’s underperforming, they’ll change some of the neural pathways, train it again, and work their way up that way to get to 90-95, to get to production accuracy.”

As deep learning networks have grown larger and more complex, big players like Facebook have resorted to designing their own systems, in partnership with vendor partners like NVIDIA.

“At Facebook, we've made great progress thus far with off-the-shelf infrastructure components and design,” said Serkan Piantino and Kevin Lee, Facebook engineers who posted a blog about Big Sur. “We've developed software that can read stories, answer questions about scenes, play games and even learn unspecified tasks through observing some examples. But we realized that truly tackling these problems at scale would require us to design our own systems.”

They characterized the new system as not only delivering “improved performance, Big Sur is far more versatile and efficient than the off-the-shelf solutions in our previous generation.

“While many high-performance computing systems require special cooling and other unique infrastructure to operate, we have optimized these new servers for thermal and power efficiency, allowing us to operate them even in our own free-air cooled, Open Compute standard data centers.”

With the announcement of Big Sur, Facebook said it is more than tripling its investment in GPU hardware to use neural networks in support of R&D efforts.

Big Sur was built to the standards of the Open Compute Project, the open-source hardware consortium started by
Facebook in 2011 that has attracted support from Apple, Microsoft, HP and other vendors, along with major financial services companies, such as Fidelity and Citibank. This means the hardware design specs for Big Sur are open to AI developers in academia and business. Facebook said Big Sur was built with the intent of utilizing the Tesla M40 at its processing engine, but the system is qualified to support a range of PCI-e cards.

“This is the first server project that’s been open sourced to encourage the server community to optimize their software for this server architecture,” said Buck. “It’s a trend were seeing in AI, everyone competing with the community to get their platform to be the preferred platform for innovation. And the good news for us is they’re all using NVIDIA GPUs.”

EnterpriseAI