Advanced Computing in the Age of AI | Thursday, April 25, 2024

Microservices for Big Data Environments 

Shutterstock / McIek

With all the deep discussions around microservices these days, it can seem like there’s a lot to consider before embracing a microservices approach to your enterprise data architecture. You probably are concerned about complex frameworks, deep monitoring requirements, extensive integrations between new compute engines, along with other potential complexities.

But actually, using microservices is a natural step in how you build and deploy technology solutions, particularly when dealing with big data. After all, microservices encapsulate many concepts that we already know, such as componentization, modularity, reusability, services-oriented architectures and enterprise service buses. The difference today is that real-time big data processing has created a clearer context to make microservices a logical paradigm.

Think about how we deal with big data, which is typically processed as well-defined pipelines in which transformations are performed in sequential order. Those pipelines can branch in different directions so you can process your data in separate, parallel ways. This pipeline paradigm aligns well with microservices architectures.

In addition, real-time processing in big data initially drove the emergence of modern publish-subscribe technologies, such as Apache Kafka. These technologies make up the foundation for delivering data through pipelines as “streams,” which are simply ordered lists of data records. Interestingly, they do not necessarily have to involve real-time processing, as event streams are relevant for many types of data environments. This means that nearly any data architecture should make use of streams. And in fact, a stream-oriented architecture is typically made up of microservices.

Microservices-based Deployment

Dale Kim of MapR

Dale Kim of MapR

A good example of an environment that benefits from a microservices architecture is a stock trade analysis system. This is a prototypical high-performance big data analytics environment. Some key characteristics include: 1) a very high rate of data ingestion, well into the range of thousands of transactions per second, 2) a set of well-defined, parameterized queries, 3) numerous outputs/destinations for the data, which are useful for the queries you have not yet defined. These characteristics are nicely addressed with microservices.

First, perhaps the most obvious advantage in this example is that microservices enable simplified data ingestion, which is especially important in high-velocity environments. This entails a simple service that does little more than insert data points into a publish-subscribe system. This service does not need to understand the record payload, so adding new data sources is easy and requires minimal code modifications. Also, a key advantage here is the ability to scale out to handle higher loads. Rather than have a single application handle all ingestion into a single stream, multiple services can be run in parallel to load records into parallel streams, thus spreading the load across multiple hardware servers.

Second, if you know some of the specific queries you will run, you can model the data to allow faster lookups. In this example, you might have bid/ask data from any number of senders that are directed to specific receivers. You might ask questions such as, “What bids/asks did sender X send in the past 5 minutes?” and “What bids/asks did receiver Y receive in the past 60 seconds?”

To support such queries at high speed, a microservice can read data from the publish-subscribe system, parse out the senders and receivers, and write the record into sender-specific and receiver-specific streams. This essentially load balances the input data across many streams, and thus many hardware servers, to enable faster querying. Ideally, this microservice will write the new records in a standardized format, so that if it needs to be updated to handle new incoming data formats, the downstream microservices do not have to be updated. As additional data formats are loaded into the system, a new microservice can be added to the system to do the work without affecting other parts of the system.

Third, a set of microservices can be run to handle other analytics requirements. For example, if you want a historical view of all bids/asks, you can have microservices write the data to a database for batch processing. Since speed is no longer a key requirement in this task, all records can be saved in a standard format that is ideal for large-scale querying, especially with SQL.

Of course, this is only the start. More capabilities around advanced computations could be added relatively easily. Suppose you want to analyze the stream to understand patterns or detect anomalies, and then send out alerts in real time. Or what if you wanted to test various machine learning algorithms in parallel, and measure the effectiveness of each on the same exact data? With this architecture, you simply add new services onto the pipeline so that all the other systems continue to run while you handle additional processing.

It’s understandable if you are become concerned about the different components of a new application architecture. But you’ll find that as you think in terms of big data, the evolution will be straightforward. Frankly, the new data environments we face today are pushing us towards microservices anyway. We know that modularity and reusability of microservices are beneficial, and we also get the agility and adaptability we need to get the most out of big data. We have to keep adjusting to the changes we see in business and in our data, and preparing for those changes with microservices is a great way to move forward with your data architecture.

Dale Kim is senior director, Industry Solutions, MapR.

EnterpriseAI