Advanced Computing in the Age of AI | Thursday, March 28, 2024

CFEngine Pushes Scalability, Adds Enterprise Features 

The CFEngine system management tool has been around for more than two decades, and after bringing in a new management lineup last year, the company bearing the same name that controls it is ramping up its efforts to push CFEngine into more enterprises.

The launch of CFEngine 3.6.0 this week, which the company characterizes as a major release, includes the usual scalability enhancements that allow for more devices to be brought under control of CFEngine. The update also includes other features that enterprise customers need, such as more sophisticated reporting capabilities and a new management dashboard. The domain-specific language that is used to create the policies that govern how CFEngine manages hardware and software has also been radically simplified, allowing for these policies to be created with a lot less code.

CFEngine is also hinting to customers about how it will branch out from systems and their software stacks to network devices and take control of these as well. In an increasingly converged datacenter, tools have to be able to manage servers, storage, and networking as well as the system and application software that makes uses of all three.

Mark Burgess, who is co-founder and CTO at the CFEngine, was post-doctoral fellow of the Royal Society at Oslo University more than two decades ago when he cobbled together a tool to manage a cluster of workstations in the Department of Theoretical Physics. The tool was open sourced under a GNU GPL license in 1993 and rode up alongside the dot-com boom to popularity among the first wave of Internet companies, who were looking for tools to manage their large-scale distributed infrastructure.

CFEngine is written in C and does not have any dependencies on Ruby, Python, Perl, and other languages, and that is one reason why some companies like it. The software has a central controller, called a hub, which feed data out to agents on the servers. Rather than having a hub push data constantly to the agents on the server nodes, CFEngine has the agents pull the whole policy for the infrastructure out to all of the agents, which have enough smarts to figure out what parts apply to them individually. In this manner, if the central CFEngine hub goes down, all of the servers in the infrastructure still know what they are supposed to do and can do it.

"Traditional monitoring tools are forensic science," explains Mahesh Kumar, vice president of marketing at the company. "They come in after the fact to try to piece things together to try to figure out what went wrong. CFEngine is a little different in that we build the blueprint for the IT real estate, which embodies the desired state of the applications, and we can see what things are not performing to that blueprint."

More than 10 million servers in the world are managed by CFEngine today, which is around a quarter of all of the machines installed, depending on whose estimate of the server base you use.  Around 10,000 companies worldwide are using CFEngine in at least 100 countries globally.

The largest CFEngine installation has more than 200,000 servers under management. CFEngine has a mix of enterprise and hyperscale customers, with Deutsche Telekom, Goldman Sachs, Chevron, Intel, AMD, State Farm, and JPMorganChase, and DirectTV being the marquee enterprise accounts and LinkedIn, eBay, and Salesforce.com being the big names in the hyperscale base. The US Navy and the Department of Energy are also big users. EnterpriseTech told you last fall about how LinkedIn had hacked a Redis NoSQL data store to CFEngine to make it possible to search for all of the relevant system metrics across its fleet of 30,000 servers before apply patches; it has greater than 5,000 servers per administrator. One unnamed big banking customer has implemented CFEngine on tens of thousands of nodes and integrated the tool with its application source code control system and it is able to manage the tier three infrastructure systems in its clusters at a ratio of 10,000 servers per admin, and in the tier two application layer, where the issues are more complex, it is getting a 500 server per admin ratio. (More detail about this difference was not available.)

With the CFEngine 3.6.0 update, the configuration management tool is able to scale across a larger number of nodes. With CFEngine 3.5.0 last year, a single instance of the CFEngine hub was able to handle the configuration policies of a few thousand servers, each with their own agent to pull policy updates from the hub. With the 3.6.0 update, CFEngine has been able to easily push that up to around 8,000 nodes, says Kumar, which spun up that many nodes on Amazon Web Services just to prove the point. For those who have even larger scale server farms to manage, CFEngine 3.6.0 has the ability to federate multiple hubs together.

The company thinks it can push it even further in part thanks to the simplified domain-specific language that is part of the release. This new language can create a policy with a 90 percent reduction in code. The update also has support for JSON so CFEngine can be integrated with other tools. So, for instance, JSON could be used to link CFEngine to a ticketing system such as ServiceNow, allowing for actions to be triggered on systems once a service ticket in ServiceNow has been approved and updating ServiceNow once changes have been made. CFEngine 3.6.0 has an improved file templating based on Mustache (so named because it uses a lot of curly braces). Policies that were written for CFEngine 3.5.0 still work in 3.6.0.

One aspect of scale is the speed at which machines can be identified that need to be updated, another is how fast you can figure out what machines need to be updated, and the final factor is how quickly they are updated. As an example of speed, Kumar says that when the Heartbleed security bug in OpenSSL hit in April, CFEngine customers could patch tens of thousands of servers in a matter of hours. Another example of speed is how many changes you make per day. Online marketer HubSpot makes over 200 changes to its stack running out on the AWS cloud each day. You have to automate that; you simply cannot do it by hand. This speed part of scale is one that all companies – large and small – are wrestling with. Even if you don't have tens of thousands of servers, you do need to be more agile and that means making more changes to software more often.

With the latest CFEngine, the link between the hub and agents has been made more efficient, and therefore frees up more compute and network capacity between the two. This is a simple change, but the 3.6.0 hub now only sends changed data from policies out over the network rather than the whole file each time it is pulled by the agent.

cfengine-block-diagram

The new CFEngine has some architectural tweaks as well. The tool now includes a new portal that has alerts and dashboards to help make it easier to for a human brain to absorb the state of thousands of machines at the same time. The backend in the CFEngine hub now includes a PostgreSQL database for storing the policies relating to thousands of managed operating systems (which can be servers, desktops, mobile devices, whatever). And as LinkedIn has done, CFEngine now includes a Redis key-value store that holds all of the hardware and software configuration data about each system, allowing for super-fast searches so admins can identify what machines need to be patched. This is the labor-intensive part of the admin job that CFEngine has not done well in the past.

Like many commercializers of open source tools, CFEngine has what is called an open core model, where the core product is open source and free as a community edition and add-on features and extensions are available in an enterprise edition. The latter is sold with a subscription support contract that costs under $100 per node – how much under, Kumar would not say. Volume discounts apply for a three-year contract and as the number of nodes increases. The enterprise edition has APIs exposed that integrate into CMDB tools and Git code repositories as well as the new Mission Portal user interface and real-time dashboards; the priced version also has pre-built packages for AIX and Solaris Unixes as well as Windows. (Only the enterprise edition can edit registry files in Windows and link into Windows service management and the NTFS file system permissions.) If you need to scale to thousands of nodes, you will have to buy the enterprise edition; it is not clear how many nodes the community edition can scale across.

Looking ahead, CFEngine has its eye on the networking stack. The company is working with a networking vendor, which Kumar cannot name, to manage network gear. "Anything that you can do in terms of server management, you should be able to do with network management, even if the use cases are different," he says. "Managed networks configure themselves dynamically based on actions, assets, time, and any arbitrary condition. And CFEngine could configure or blacklist services, report inventory or compliance, define variables, and find changes in the network. Anything that I can put down as a policy – this switch can only connect to this particular network, and as soon as it deviates, CFEngine will catch that."

In the past year, a new management team has been brought in to do a better job commercializing CFEngine, and the expansion out to networking is part of the plan. In 2013, revenues were up 70 percent; the company is privately held and has 40 employees. In 2011, CFEngine secured $5.5m in Series A financing from Ferd Capital and moved its headquarters to Palo Alto, California. If growth continues in 2014, it is reasonable to expect some more investors in subsequent funding rounds and a further build out of capability of the CFEngine tool. Many of the biggest users of the tool have a vested interest in seeing it improve and scale, and they have the money to make it happen, too.

EnterpriseAI