Managing 30B Bid Requests, 1.5B Users per Day in (near) Real Time
At the beating heart of the mobile advertising industry is what Ellen DeGeneres calls “our ADD culture.” It’s predicated on delivering up-to-date content in real time while people are doing, or trying to do, something else: checking a stock quote, reading a news headline, looking at Facebook messages. Mobile ad campaigns be fast, responsive and personalized. Lag time means lost opportunities, dollars and customers.
Manage, a Mountain View company founded in 2011 is in what’s called “programmatic mobile marketing and advertising" and helps internet companies like Uber, Wish, and Amazon manage global mobile marketing campaigns by buying real-time programmatic inventory to drive users to mobile applications. The company reaches more than 1.5 billion users across hundreds of thousands of mobile apps, driving clicks and installs while also analyzing data and optimizing campaigns toward “post-install engagement,” such as purchases, registrations or moves played within a game.
“We deal with a lot of data,” Kai Sung, Manage’s CTO and co-founder, told EnterpriseTech, noting that Manage generates a terabyte of data per day. “We process more than 30 billion bid requests daily to figure out which impressions we want to buy on behalf of our advertisers. And we need to build machine learning models to predict probability of people clicking on the ad, installing the app and converting into some downstream event, like a purchase or a first ride or some registration event.”
It’s a world where quick insight is everything real time, or as close thereto as possible, along with data analytics for uncovering value and ad campaign efficiencies. End users of the Manage platform use a dashboard to visualize data and track engagement and impressions on mobile ads.
Manage started off several years ago using MySQL, but ran into scaling issues as data volume grew. So the company moved to Hadoop to power their underlying statistics pipeline combined with Hive for data deduplication, summarization, query and analysis; and Kafka for real-time data feeds. But the system populated the Manage dashboard at the glacial pace of two hours.
“We noticed Hive was slow, even though it could handle the amount of traffic we deal with,” Sung said. “Our pipeline was delayed for a couple of hours. We needed something that could provide us more fresh data for reporting and also for our analytics team to run ad hoc queries.”
He also said that Hadoop “is kind of a beast for us, especially with a small engineering team, there are a lot more components involved,” noting that Manage has about 30 employees.
So the company began a search for a faster database platform better suited to its time-sensitive operational requirements. An engineer on the Manage staff suggested the company look at MemSQL, which positions itself in the streaming analytics space as a high performance, in-memory database that combines the scalability of distributed systems with the familiarity of SQL.
“There’s a free version we could download and start prototyping on our own,” Sung said. Soon Manage and MemSQL were working together on the company’s requirements, and within six months it had signed a contract and was in production.
According to Sung, the biggest impact of MemSQL has been to reduce the delay in the “freshness of our data” from two hours to 10 to 15 minutes. “From the time we generate a bid to the time we’re able to see it in our reports and run analytics on it is much faster, so we can react to changes in marketplace faster,” he said.
Manage logs 1TB of data daily, including bid responses, impressions, clicks, installs and events across advertising campaigns. The team uses MemSQL Streamliner, an Apache Spark solution, to first stream log data from Apache Kafka, then store it in the MemSQL columnstore for further processing. As new data arrives, the pipeline de-duplicates the data and aggregates it into various summary tables within MemSQL.
Along with fresher data in the statistics pipeline, MemSQL enables Manage’s engineering team to summarize raw event data in various MemSQL table dimensions to create personalized campaign performance reports for customers and query data directly through the Manage Reporting API, a capability that was previously not available. “Because of the way MemSQL stores data in columnar formats it allows our analytics team to run queries slicing the data into different dimensions. Those queries used to have a lot longer latency using Hive and MySQL, so it works well for our workload.
“We’ve built a highly scalable, real-time data pipeline that ingests and summarizes data as fast as we produce it,” said Sung. “Our analytics team is able to run ad-hoc queries on log-level data within seconds.” He added that MemSQL is able “to ingest data and leverage memory as well as disk, so a big part of it is their ability to store in columnar format where it compresses the data a lot better. We don’t have to iterate as many summary tables because we can just keep it as one table with a lot of columns.”
He also said adding nodes and removing them is relatively straightforward, and that it’s compatible with MySQL, “we only have to make minimal changes to our applications because it just uses straight SQL.”
Sung said he views MemSQL “as a potential replacement for what we currently do on Hadoop. It’s less infrastructure and you have the capability to ingest the data, you can store it scalably, and they also package in Spark processing. We haven’t quite moved some of our Spark jobs onto MemSQL, but that’s something we’d like to do next, migrate our machine learning jobs to run on top of the Spark instance for MemSQL.”