The Hyperscale Lessons Of Healthcare.gov
The last six months have been a hair-graying experience for everybody involved in the Healthcare.gov rollout. While it ultimately managed to meet its goals and enroll 7.1 million people, the disastrous start and near implosion of the website behind President Obama's signature project can provide hyperscale professionals with lessons on what went right, what went wrong, and what never to do again.
When Healthcare.gov ground to a stop soon after launching in early October 2013, MarkLogic CEO Gary Bloom was worried. "We were very concerned," Bloom tells EnterpriseTech. "Whenever you have a production system that struggles when it goes live, you have to be concerned. If you're not, then something is wrong with you as a vendor."
The Silicon Valley company had been working with other contractors on the project for years. Its role was to provide the database that powered two key aspects of the Healthcare.gov website, including the Federally Funded Marketplace (FFM), the core element of the website, where citizens can shop and apply for healthcare insurance through Medicare and state and private carriers; and the Data Services Hub, which acts as a data exchange with the Internal Revenue Service, Social Security Administration, and other agencies to ensure eligibility for certain programs.
MarkLogic's database was picked by the government because of its flexible and schema-less approach to data storage. While the initial roll-out would only be in the hundreds of terabytes, it would be called on to process thousands of transactions per second in a fully ACID manner, and eventually to grow into the hundreds of petabytes range. Bloom, who formerly ran Oracle's database business, said it might be "the most complex data integration challenges in the history of IT."
Centers for Medicare and Medicaid Services (CMS), the agency tasked with building and running the website, picked two main system integrators to implement Healthcare.gov, including CGI Federal, the American subsidiary of a Montreal-based information technology firm, that was picked to program, configure, and install MarkLogic on standard Intel servers for the FFM; and Quality Software Services Inc (QSSI) a systems integrator based in Columbia, Maryland that was picked to handle the Data Services Hub as well as the Enterprise Identity Management System (EIDM) layer, which would be implemented on an Oracle database and infrastructure running on an Exadata appliance.
Soon after launch, it became apparent that the website's troubles were not the result of a massive influx of visitors, as the Obama Administration initially claimed. Instead, it appeared that there was something technically wrong with the website itself. Considering that President Obama's campaign so deftly outmaneuvered his political rivals when it came to using social media in the 2008 and 2012 campaigns – and because of the highly politicized nature of the Affordable Healthcare Act ("Obamacare") that the website was supposed to implement – the stumble garnered national news attention.
Then the finger pointing began. Pain was heaped on everybody involved, ranging from Health and Human Services Secretary Kathleen Sebelius and the President himself to the technology providers and systems integrators. MarkLogic took some heat too. According to a New York Times story, former CGI workers said the "unfamiliar nature" of MarkLogic's NoSQL database and the fact that it handles data differently from traditional relational systems like Oracle's 11g and IBM's DB2 made the work progress slower.
CGI's unfamiliarity with MarkLogic definitely contributed to problems, says MarkLogic's senior vice president of global technical services Jon Bakke. "CGI tried to build elements of the exchange as if they were going to build it on a relational database," Bakke tells EnterpriseTech. "That does not gel with how we would normally build a MarkLogic system."
According to Bakke, MarkLogic officials repeatedly warned CGI about the dangers of using relational database constructs in a non-relational, document-oriented NoSQL data store, but the company did not listen. CGI apparently used a coding technique with the FFM that would have been commonplace on traditional three-tier relational system, but which proved to be bulky and slow in its NoSQL database.
"The way that a lot of the code was developed for the application was using automated modeling tools" he says. "They configured their modelers to [generate middle-tier Java objects] in a three-tier style architecture. Essentially, they created a layer of abstraction that didn't need to be there."
The public looks at Healthcare.gov as a website. But in actuality, it is a complex enterprise system that happens to have a Web interface, and the coding technique selected by CGI simply wasn't appropriate for this Web application. "When you let code gen tools do that, they oftentimes don't perform well. They'll perform well for 1,000 users, just not 100,000 users, because there's so much overhead built-in."
MarkLogic had trained 60 to 70 CGI employees in the years leading up to the launch of the website, and those developers were proficient in the MarkLogic technology, Bakke says. But a high rate of turnover at CGI meant that none of the trained developers were working for the firm when the website went live on October 1. At that point, the biggest problems revolved around project management and a lack of requisite skills.
"In retrospect, had we challenged CGI a little more, we might have had an impact on that," Bakke says. "But there were so many application developers moving in so many directions that, in the end, the sheer management of the personalities and their ability to congeal around one way forward ended up to be the primary reason they weren't successful."
After the problems appeared and the Obama Administration made fixing it a high priority, MarkLogic increased the number of workers on the project, from about six to about 35, and CGI's role diminished. The MarkLogic workers stripped out unnecessary layers of Java code, and boosted page response times from about 1.5 seconds to less than one-third of a second, Bakke says.
At this point--in late October and early November--your chances of hearing the phrase "MarkLogic is down" decreased significantly in and around CMS. However, the website as a whole still had problems, which led the Obama Administration to call in a "tech surge" to fix it. On October 24, the system integrator QSSI was put in charge of the website, and White House CTO Todd Park subsequently brought in several experts, including Michael Dickerson, a site reliability engineer at Google who worked on Obama's election campaign.
There are few people who know all the details of how Healthcare.gov was salvaged. But from what Bakke related to EnterpriseTech – and with the knowledge that hindsight is 20/20 – it wouldn't have taken a brilliant technician (which Dickerson admittedly is) to see where the problems were and start fixing them.
For starters, there were hardware problems galore. Terremark, the Verizon subsidiary that was hired to implement and host the servers and storage arrays, chose to use three different NetApp filers, Bakke says. "In standard best practice, you would normally expect to see a homogenized implementation across the storage and compute tiers, where all the servers are identical, where all the storage is identical," he says. "They chose not to do that or didn't have enough time. I don't know which." Terremark was fired for the poor performance and will be replaced by Hewlett-Packard.
The implementation of the network was also botched, according to Bakke. Initially, the network was rated for about 4 Gb/sec of throughput, but was throttled back to 1 Gb/sec. When the network was expanded to 60 Gb/sec, it was limited to 15 Gb/sec of throughput. While there was spare capacity on the network, CGI was forced to run tests during the day on the production system because there was no way to delineate between the production network and secondary ones. "They had the capacity. It was just not available, and there was nobody really to explain why," he says. "It was just improperly done."
The MarkLogic database itself was configured to run on 48 nodes equipped with Red Hat's Enterprise Linux in VMware virtual machines installed across 12 to 15 Intel-based servers, Bakke says. For whatever reason, some of those VMs were running on VMWare vSphere version 4.1 while others were running on vSphere 5.1. "When you're running in a datacenter and see that level of inconsistency in the most obvious aspects, you have to believe the networking, the cabling, the hardware. . . all these things are inconsistently done in an untested configuration," he says.
MarkLogic continues to work on the Healthcare.gov website. It is working closely with Accenture, who was brought in to replace CGI, and continues to work closely with QSSI, Avalon Consulting, and HTC Consulting. The FFM portion of the website has posted an uptime number greater than 99 percent since late November, and has not lost any data. The Data Services Hub, which MarkLogic worked closely on with QSSI, was re-developed in the proper NoSQL fashion from the start, and has not had any issues, Bakke says.
Unfortunately, the Healthcare.gov website has still experienced problems. According to Bakke, the downtime experienced on Monday, March 31, was related to Oracle's EIDM, which he says has been down 60 to 70 times over the course of the six-month project. Apparently, the EIDM was coded to stop accepting application on midnight March 30 instead of March 31, and nobody caught the error in time, he says. As a result, CMS had to dump the queue of website visitors in the system several times on Monday before the system could be fixed and rebooted. "I think they're loathe to acknowledge what would have happened if they chose it for the database," he says.
Despite the hiccups, the MarkLogic team is counting Healthcare.gov as an unwavering success. "One thing I was never concerned about was the ability of MarkLogic to scale to handle the millions of users that will have to use the site over the course of time," Bloom says. "This is a project that could not have been done successfully in relational technology. It's an absolutely high-water proof point of what NoSQL technology can do."
Government contractors are not generally known for their efficiency and adherence to deadlines. But the lack of testing, oversight, communication, and basic enterprise IT common sense in the initial Healthcare.gov roll-out may turn out to be a case study in how not to implement a hyperscale Web application. And the final setup may show how to do it right.