InkaBinka Moves From Cloud To Moonshot For Launch
The premise behind a new startup called InkaBinka, which launched this month and which is only a year old, is that none of us have the time to keep up with what is going on in the world. Or more precisely, we do not have time to wade through thousands of words in multiple stories to distill the salient facts of a piece of news in business, politics, technology, or entertainment.
And to that end, InkaBinka has come up with a different kind of news aggregation service that finds those salient facts from combing through multiple news sources and presents it simply in four bullet points with imagery that its co-founders say helps the human brain retain that data better than sitting through what InkaBinka disparagingly calls “the long read.”
Whether or not you think long form journalism is dead, just napping, or having a bit of a resurgence, the co-founders behind InkaBinka are not so much interested in having that argument as they are in perfecting natural language processing algorithms that can get InkaBinka off the ground. Founders Kevin McGushion, who is CEO, and Chris Brahmer, who is COO, want to prove they can automate the process of providing useful summary information on a wide variety of topics from myriad text and imagery sources culled from the Internet. News aggregation is merely the first implementation of their software, which could be used for all kinds of summarization in different contexts.
The idea behind InkaBinka is simple enough, and it has some echoes of Twitter, which sounded downright silly to many people at first. What can you say that is meaningful in 140 characters? Enough to start a war, as it turns out. (It would not be surprising for Twitter or Facebook or another Web giant to snap up InkaBinka if the service, which launched two weeks ago, takes off.) InkaBinka chews on the content on the Internet and compresses it down to four bullet points of 425 words, plus some still imagery, that can be consumed by most readers in about 20 seconds.
Other news aggregators, such as Google News, Zite, Pulse, and Flipboard all link to the longer stories out there on the Web, but InkaBinka wants to pull out the important bits and present them. (It has to do so in such a way as to not violate copyrights, too, which seems a bit tricky.) This is a day many journalists have been worrying about, particularly after Watson beat the human champs on the Jeopardy! game show a few years back. But all is not yet lost. InkaBinka still needs journos and the publications that support them as grist for its mill. For now at least. But, at some point, the Internet will write stories about itself, and companies like InkaBinka are going to help move that process along.
The interesting bit for EnterpriseTech, of course, are the technology choices that a young startup like InkaBinka makes as it braces to go from a few server nodes to hyperscale. Services like Facebook, Twitter, and LinkedIn have all ridden up the hyperscale hockey stick, and it is important to make technology choices that allow for fast growth if a service takes off. (No one wants to be showing users of a popular service the Fail Whale, as Twitter did when its systems crashed in the early years.) Interestingly, InkaBinka started out on several different clouds during its development stage, and after playing on the clouds the startup has opted to set up its own datacenter and use Moonshot hyperscale systems from Hewlett-Packard to run its applications.
The back-end of the InkaBinka system is fairly complex, says Brahmer, and is based on a mix of Windows and Linux platforms.
“From the get-go, we just agreed that we were going to just use the right tool for the job, dictated by whatever workload and what we are trying to accomplish. We are not married to any particular vendor or software platform. We need to make the service really fast and it needs to be able to support a ton of users, and hopefully this thing takes off and by all accounts it looks like we are heading in that direction.”
The InkaBinka service polls about 1,000 news sites across the world, and every two minutes it pulls in anything that has changed on those sites for processing. By doing keyword searches against this mountain of data, the InkaBinka service can figure out what is, in fact, the news. Meaning the new stuff that people care about most.
Sitting right behind those app front ends back in the InkaBinka datacenter in Marina Del Rey, California is a set of Nginx Web servers and load balancers. Nginx has quickly become the Web server of choice for the hyperscale elite, including Facebook, Yandex, Netflix, Hulu, Box, Drobox, Groupon, WordPress, and a slew of others. Over 146 million Web sites are powered by Nginx, which is open source with for-fee extensions and support, and the upstart Web server has surpassed the open source Apache Web server recently. Nginx bypassed Microsoft’s IIS Web server in terms of site count back in 2012.
The natural language processing algorithms at InkaBinka that cull through news articles to pull out the key ideas are written for Windows servers in C#. Many of these algorithms are homegrown, but InkaBinka is making use of AlchemyAPI, a text mining and semantic analysis suite, to chew through the news to find the nuggets to make bullets. The back-end data store is not SQL Server, but Couchbase.
“I have been a Microsoft SQL Server guy during my whole career, and I was setting out to build InkaBinka on SQL Server,” explains Brahmer. “I started looking for a caching layer, starting with Memcached, which is extremely popular. What I found out is that CouchDB and Memcached had a baby and it is called Couchbase. Couchbase is written in Erlang with some C components, but what makes it superfast is how it shards data across a cluster. It is beautifully designed.”
The service replicates data stored in Couchbase to an Elasticsearch search engine, which is open source and which is one of the several such search tools based on the Lucene search engine. (Elasticsearch is also integrated into the Cloudera, Hortonworks, and MapR Hadoop distributions to allow for searching of data pumped into their underlying file systems.) Like Couchbase, Elasticsearch is designed to easily scale by adding multiple nodes to a cluster; this scalability is built in rather than something that has to be cobbled together by techies and, equally importantly, the way data is spread across the Elasticseach cluster, the more nodes you have, the faster the queries run because data is sharded across the nodes and each node can perform a portion of each query and give results back to a master node.
Couchbase is good for moving data in and out of a system, and while it has some search functionality and some MapReduce capability, Elasticsearch, says Brahmer, is superior for really fast searching. The integration between Couchbase and Elasticsearch was done by InkaBinka’s own coders. The company’s own crawlers grab news stories, dumps them into Elasticsearch for indexing, and then passes off the results to the natural language processing stack to be chewed on. The images that are used are pulled from Google image searches.
The algorithms that InkaBinka has come up with sort through this mountain of text (generated by untold numbers of journalists and bloggers) and try to reckon what the news is at any given moment and also takes a stab at generating the four bullet points for a particular story.
“The difficulty here is coming up with those four bullets,” says McGushion. “Every writer has a different style of writing, but there are obviously patterns that we have noticed and there are a number of ways of identifying facts in a story. We have also noticed a basic writing structure for stories, and once we have those facts, we can do a quick Google search for an image. Depending on the writing style, and our algorithms are constantly evolving, we are about 50 to 70 percent accurate in what we create. We could turn it loose at that point and say that’s good enough, but because people read the news for investment information and medical information, we put eyes on it and an editor determines if the bullets are true until we can refine the tool a little bit more. But it is pretty surprising that this is automatable.”
Any time InkaBinka was selecting a piece of software, Brahmer asked the question: Who else is using the software and do they scale a lot further than InkaBinka will in its early years? As it turns out, online game maker Zynga chose it for the backend for Farmville, and that was enough of an endorsement given all of the other technical attributes of Couchbase. InkaBinka looked at MongoDB, which is fast and which also scales well, but in the end decided that Couchbase was better suited for its applications.
Interestingly, like Zynga, InkaBinka has also decided to go with its own datacenters and systems instead of using the public cloud. And there are good reasons for that: InkaBinka does not want to deal with the unpredictable latencies of the public cloud, and it has tried a number of them as the service was being developed.
Like other young startups, InkaBinka started out building its service on Amazon Web Services. Then Brahmer and McGushion started showing off the service to a number of companies, and got the attention of Microsoft, which had a special program for startups called BizSpark Plus that gave it access to the Azure cloud for free for up to three years. So InkaBinka jumped from AWS to Azure. After a few months, McGushion had a chance to demo the InkaBinka service to some higher-ups at Hewlett-Packard, and HP wanted suggested InkaBinka move to the HP Cloud (which they did last October) and also that they take a look at the Moonshot hyperscale servers for running the service.
“Up until that point, we had no interest or desire – we did not even think it was feasible – to run on our own hardware because of the funding issue,” says Brahmer. But HP sent out a team of techies, including Dwight Barron, who is the architect of the Moonshot system, and InkaBinka saw a good fit. “The scale model for our stack is completely horizontal. If I need more database, I just add more nodes. If I need more search, I just add more nodes. If I need more of Nginx or anything else, the answer is always add a few more to it. Moonshot allows for InkaBinka to scale exactly the way the software was designed to.”
So the obvious question is, why can’t InkaBinka do that on the HP Cloud, Microsoft Azure, or Amazon Web Services? As it turns out, as InkaBinka was ramping up with an increasing number of beta users, it could just fire up new virtual machines to scale out its workloads. But there were latency issues between the nodes. “It became a performance issue,” says Brahmer. “A number of parts of our system, but Couchbase in particular, is extremely intolerant of any drops in latency or any kind of spikes. Couchbase really wants communication that is really fast. Clouds can’t guarantee that.”
At the time InkaBinka moved to the HP Cloud last fall, the m300 server cartridges for the Moonshot 1500 enclosure we not yet shipping. These are the server nodes that InkaBinka decided it wanted to run its code, and they have a single “Avoton” eight-core Atom C2000 processor on each node, and in this case, with 32 GB of main memory.
The “Gemini” Moonshot 1500 enclosure can have up to 45 cartridges in its 4.3U of space. The server cartridges snap in from the top, as do two Ethernet switch modules for linking server nodes in the enclosure to the outside world. The backplane in the Moonshot chassis allows for the server cartridges to be linked to each other in a 2D torus topology without the need an internal switch. This backplane is also used to link server cartridges to storage cartridges, which come with disk or flash drives. The chassis has 7.2 Tb/sec of bandwidth, which is plenty enough for internode connections as well as links to storage. This 2D torus is used to link three nodes in a north-south configuration (like an n-tier application) or fifteen nodes in an east-west configuration (like a more traditional parallel cluster or cloud).
InkaBinka got rolling with a 15-node starter kit, which had server nodes and some spinning disk. As the site was getting close to launch, InkaBinka bought a fully loaded Moonshot 1500 enclosure with 45 of the m300 cartridges.
“I have my own cloud, and I am in complete control and I am not subject to any overhead from anyone else,” Brahmer says. That may sound old school in this cloudy era, particular with companies like Netflix dumping its own IT for AWS, but again, it is not like Google and Yahoo and Facebook and Zynga – and Amazon for that matter – don’t have their own datacenters with their own machines aligned to their own code. You don’t have to be a big company to want to do your own IT, but if you plan to be a hyperscale service, as InkaBinka plans, then maybe it is best to learn how to do that early rather than later.
So how many users can this initial infrastructure support? Brahmer tells his partner McGushion and EnterpriseTech he has no idea, really. “It is very hard to measure, once you get to certain multiples of ten. With ten users or a hundred users, that’s nothing. But the system starts to behave differently under certain kinds of load, and sure we can run load tests, but those only give you an idea. We have no projected what 45 nodes with SSDs can do, but it is a ton of users. I have to think that this will support at least a million users.”
That is not a million concurrent users, of course, but a million users in its database and probably something on the order of many tens of thousands of concurrent users. Now, what about extrapolating up to the size of something like Facebook, just for fun? The social network has 1.28 billion users at the moment, and if InkaBinka had to scale up to that size, it would take around 1,000 Moonshot enclosures, loaded up with SSDs and Avotons, to handle the load. (This all assumes linear scaling, of course.) This is what HP and InkaBinka alike hope will happen. Time will tell.
In setting up its own datacenter, InkaBinka discovered that the network pipe coming into the facility was not even big enough to saturate one Moonshot enclosure, so it upgraded to a 10 Gb/sec trunk that is wired directly into the enclosure. The software that serves up the pages at InkaBinka is lightweight, and the images are served up through the content delivery service on the HP Cloud (now called Helion). So the whole setup is meant to be lean and mean. In the event that there is a spike in traffic, InkaBinka can burst out to Helion.