Advanced Computing in the Age of AI | Friday, April 19, 2024

Yahoo, NASCAR Intrigued by Spectra DS3 Object Storage for Tape 

Tape is not dead by any stretch of the imagination at the largest data centers of the world. But it could very well be that the bell is tolling for traditional tape backup software.

Tape library maker Spectra Logic hosted its Forever Data 2013 conference in Denver, Colorado this week, and EnterpriseTech was on hand to learn all about the company's new DS3 specification to add support for sequential storage to Amazon Web Services' S3 object storage protocol. We also got a peek at Spectra's forthcoming BlackPearl appliance, which implements that DS3 specification to interface with its tape libraries, thus allowing modern Web protocols to move files in and out of the archive.

We also spent some time talking to executives from Yahoo and NASCAR Productions, which have very large data archives on tape, about how the combination of the DS3 protocol and the BlackPearl appliance might be useful in their shops.

Spectra is calling its new approach to archiving deep storage, and according to the company's chief marketing officer, Molly Rector, it is targeting the usual suspects in life sciences, media and entertainment, and oil and gas industries as well as hyperscale Web providers and enterprises building private clouds and big Hadoop clusters with its new approach. Deep storage also has applicability to federal and state governments, she says, as well as to traditional active archive customers across all industries that rely on a mix of disk and tape storage and have automated the movement of hot and cold data across them.

The problem that Spectra has solved is simple to describe but not so simple to do. Systems and now even tape libraries have file systems to organize their data. File systems are great for when people want to share and edit data concurrently, but their nested architecture makes it difficult to search for data. Basically, you have to know where something is as well as what it is called to find it. With modern Web applications, you often just want to get a bit of data – a video clip or a snippet of audio, for instance – and either download it or make use of it out on the Web inside an application. And moreover, you usually have lots of such files, and the tree structure of a file system is too cumbersome.

Object stores, like Amazon's Simple Storage Service, or S3, were created to get around the limitations of file systems and allow for large numbers of objects to be housed in what is essentially a flat file system. You give something a name, give the object store a bunch of metadata to describe what it is, and now you have a simple way to dump large amounts of files into that object store while also being able to find a particular file quickly thanks to the metadata.

The S3 specification has been opened up by AWS so other cloud providers and application writers can adhere to it, and it uses simple HTTP commands – called Representational State Transfer, or REST for short – to do object manipulation. These commands are GET, POST, PUT, PATCH, and DELETE, and the important thing is that they can be scripted with any number of popular programming languages.

The problem with S3 is that it only speaks to random access storage, like flash memory and disk drives, and it has no concept of sequential storage like a tape drive and its removable cartridges. So Spectra created a superset of the S3 specification called DS3 that adds Bulk PUT and Bulk GET commands to write and read a large number of objects. The DS3 spec also can take an S3-style "bucket" of objects and translate them to a collection of tape cartridges that can be treated as a group and ejected if they need to be moved to an offsite archiving facility or, in the media business, so they can be sent off for post-production work.

spectra-blackpearl

The DS3 specification does not speak directly to tape. You still need an intermediary to do the translating, and that is what the BlackPearl appliance does. This is an off-the-shelf X86 server with four flash drives that takes objects streaming off the production systems and stages them to be archived onto Spectra's tape libraries. The appliance uses the ZFS file system developed by Sun Microsystems many years ago internally to manage objects, and has InfiniBand or Ethernet links to production machines and Fibre Channel or SAS links to the tape libraries, which run the LTFS file system.

David Trachy, senior director of emerging storage technologies at Spectra, says the BlackPearl appliance has enough oomph to drive four tape drives in a Spectra library at the same time. If you need more throughput, you add more appliances. (Incidentally, the appliance can also be used to import data from one drive and format, say LTO5, and export it to a newer drive and format, say LTO6 or TS1140, all in the background.)

The DS3 specification is out now, and Spectra is hoping that many companies will write client drivers for their various applications that adhere to it. The BlackPearl Appliance is in beta testing and will be generally available in December. The plan is to bundle the appliance with the Spectra libraries for a nominal fee.

One of the early testers for the DS3 protocol and the BlackPearl appliance is Yahoo. The Internet giant has three main data centers, which are located in Omaha, Nebraska; Seattle, Washington; and Lockport, New York (outside of Buffalo, where the air is cool and the electricity from Niagara Falls is cheap). Like other hyperscale Web operators, Yahoo doesn't like to throw its data away, but the truth is a lot of data does end up in the bit bucket. Part of the problem, explains Kevin Graham, principal storage architect at Yahoo, is that there is no easy way to get data out of production systems and into the tape archive and then back out again.

Today at Yahoo, data is pulled out of production systems and dropped onto disk arrays, and then traditional tape backup software is used to push it out to the tape libraries. Each of those data centers have over 50,000 tape cartridges – some sitting in Spectra Logic libraries, some sitting in ones made by the few other vendors in the market. These libraries have hundreds of petabytes of data – Williams is forbidden to say precisely how much –and they would perhaps have even more if it was not so tough to move data from production systems to archive.

"Unfortunately, backups are not useful," says Graham. "Archive is what we want. Backups are what we are stuck with."

While the LTFS file system added to tape drives is helpful in that it makes tape look a bit more like disk and therefore a little easier to organize and search, you are still coping with a file system and all of the limitations that come with that. What Yahoo really wants is object-level storage for tape, and given its scale, with billions and billions of objects and perhaps trillions in its future as it gives Yahoo users 1 TB Flickr photo and 1 TB Yahoo Mail accounts, it doesn't want to wait for a tape backup to run to start pushing objects out to tape. Objects should be created in Yahoo apps and pushed out to the archive as soon as possible, particularly if Yahoo wants to be a trusted archiver of our lives.

"This is going to be really disruptive," Graham says of the combination of the DS3 protocol and the BlackPearl appliance. "We have a system and it is real, and we are fleshing out the use cases."

One of those early use cases is archiving datasets in Hadoop to tape and then using the DS3 protocol to make it easy for data scientists on the Yahoo Labs team, who are constantly refining algorithms to make sure we see the right ads and content on Yahoo sites, to get that data back out of the archive.

Yahoo has clusters with a combined capacity of 100,000 X86 cores driving its Hadoop applications. These machines chew on an enormous amount of data, but there is no way all of the data that Yahoo generates can be stored online in its many Hadoop clusters. So it gets pushed out to tape using backup software. (Graham can't name names on what software it uses.)

To get the data back into a system, Yahoo has to restore an archive to disk and then monkey around with it to get the portion of the data set that a data scientist wants. This is inefficient and slow. With a RESTful interface, data scientists could search the metadata for the tape archive for the chunks of data they want and do a Bulk GET command to retrieve it to a Hadoop cluster. Except for the fact that getting that data out of the archive could take days to weeks, depending on the size of the data sets, the Yahoo Labs team would not know they are not pulling the files off Dropbox.

Yahoo has been instrumental in helping Spectra create a Hadoop connector for the DS3 protocol, and Spectra hopes to find members of the Hadoop community that will help it get adopted in the upstream code and eventually become part of the commercial Hadoop distributions.

By making the Hadoop connector free and embedding the BlackPearl appliance in its tape libraries, Spectra has set itself up to sell more tape libraries as Hadoop takes off. The hyperscale Web operators are not as big a part of the company's business today as the traditional HPC and media and entertainment markets, but as Hadoop is adopted by more and more large enterprises, and they succumb to the idea that they, too, can't throw any data away, this business could soon rival HPC and media/entertainment as drivers of Spectra's library business.

Driving down costs

EnterpriseAI