- As companies become more automated, and their business processes become more automated, we end up with many applications talking to one another. This is a humongous shift in system design as it’s about doing the work in a fully automated fashion by machines.
- In traditional databases, data is passive and queries are active: the data passively waits for something to run a query. In a stream processor, data is active and the query is passive: the trigger it’s the data itself. The interaction model is fundamentally different.
- In modern applications, we need the ability to query passive data sets and get answers for users actions but we also need active interaction through data that is pushed as an event stream to different subscribing services.
- We need to rethink what a database is, what it means to us, and how we interact with both the data it contains and the event streams that connect it all together.
At QCon London in March 2020, I gave a talk on why both stream processors and databases remain necessary from a technical standpoint and explored industry trends that make consolidation likely in the future. These trends map onto common approaches from active databases like MongoDB to streaming solutions like Flink, Kafka Streams, or ksqlDB.
I work at Confluent, the company founded by the creators of Apache Kafka. These days, I work in the Office of the CTO. One of the things we did last year was to look closely at the differences between stream processors and databases. This led to a new product called ksqlDB. With the rise in popularity of event-streaming systems and their obvious relationship to databases, it’s useful to compare how the different models handle data that’s moving versus data that is stationary. Maybe more importantly, there is clear consolidation happening between these fields. Databases are becoming increasingly active, emitting events as data is written, and stream processors are increasingly passive, providing historical queries over datasets they’ve accumulated.
You might think this kind of consolidation at a technical level is an intellectual curiosity, but if you step back a little it really points to a more fundamental shift in the way that we build software. Marc Andreessen, now a venture capitalist in Silicon Valley, has an excellent way of putting this: “software is eating the world”. Investing in software companies makes sense simply because both individuals and companies consume more software over time. We buy more applications for our phones, we buy more internet services, companies buy more business software, etc. This software makes our world more efficient. That’s the trend.
But this idea that we as an industry merely consume more software is a shallow way to think about it. A more profound perspective is that our businesses effectively become more automated. It’s less about buying more software and more about using software to automate our business processes, so the whole thing works on autopilot — a company is thus “becoming software”, in a sense. Think Netflix: their business started with sending DVDs via postal mail to customers, who returned the physical media via post. By 2020, Netflix has become a pure software platform that allows customers to watch any movie immediately with the click of a button. Let’s look at another example at what it means for businesses to “become software”.
Think about something like a loan-processing application for a mortgage someone might get for their home. This is a business process that hasn’t changed for a hundred years. There’s a credit officer, there’s a risk officer, and there’s a loan officer. Each has a particular role in this process, and companies write or purchase software that makes the process more efficient. The purchased software helps the credit officer do her job better, or helps the risk officer do his job better. That’s the weak form of this analogy: we buy more software to help these people do a better job.
These days, modern digital companies don’t build software to help people do a better job: they build software that take humans completely out of the critical path. Continuing the example, today, we can get a loan application approved in only a few seconds (a traditional mortgage will take maybe a couple of weeks, because it follows that older manual process). This uses software, but it’s using software in a different way. Many different pieces of software are chained together into a single, fully automated process. Software talks to other software, and the resulting loan is approved in a few seconds.
So, software makes our businesses more efficient, but think about what this means for the architecture of those systems. In the old world, we might build an application that helps a credit officer do her job better, say, using a three-tier architecture. The credit officer talks to a user interface, then there’s some kind of back-end server application running behind that, and a database, and that all helps the person do risk analysis, or whatever it might be, more efficiently.
As companies become more automated, and their business processes become more automated, we end up with many applications talking to one another: software talking to software. This is a humongous shift in system design as it’s no longer about helping human users, it’s about doing the work in a fully automated fashion.
From Monoliths to Event-Driven Microservices: the evolution of software architecture
We see the same thing in the evolution of software architecture: monolith to distributed monolith to microservices, see Figure 1. below. There is something special about the event-driven microservices though. Event-driven microservices don’t talk directly to the user interface. They automate business processes rather than responding to users clicking buttons and expecting things to happen. So in these architectures there is a user-centric side and a software-centric side. As architectures evolve from left to right, in Figure 1. below, the “user” of one piece of software is more likely to be another piece of software, rather than being a human, so software evolution also correlates with Marc Andreessen’s observation.
Figure 1: The evolution of software systems.
Modern architectures and the consequences of traditional databases
We’ve all used databases. We write a query, and send it to a server somewhere with lots of data on it. The database answers our questions. There’s no way we’d ever be able to answer data-centric questions like these by ourselves. There’s simply too much data for our brains to parse. But with a database, we simply send a question and we can get an answer. It’s wonderful. It’s powerful.
The breadth of database systems available today is staggering. Something like Cassandra lets us store a huge amount of data for the amount of memory the database is allocated; Elasticsearch is different, providing a rich, interactive query model; Neo4j lets us query the relationship between entities, not just the entities themselves; things like Oracle or PostgreSQL are workhorse databases that can morph to different types of use case. Each of these platforms has slightly different capabilities that make it more appropriate to a certain use case but at a high level, they’re all similar. In all cases, we ask a question and wait for an answer.
This hints at an important assumption all databases make: data is passive. It sits there in the database, waiting for us to do something. This makes a lot of sense: the database, as a piece of software, is a tool designed to help us humans — whether it’s you or me, a credit officer, or whoever — interact with data.
But if there’s no user interface waiting, if there’s no one clicking buttons and expecting things to happen, does it have to be synchronous? In a world where software is increasingly talking to other software, the answer is: probably not.
One alternative is to use event streams. Stream processors allow us to manipulate event streams similar to the way that databases manipulate data that is held in files. That’s to say: stream processors are built for active data, data that is in motion, and they’re built for asynchronicity. But anyone who has used a stream processor probably recognizes that it doesn’t feel much like a traditional database.
In a traditional database interaction, the query is active and the data is passive. Clicking a button and running the query is what makes things happen. The data passively waits for us to run that query. If we’re off making a cup of tea, the database doesn’t do anything. We have to initiate that action ourselves.
Figure 2: The database versus stream processing.
In a stream processor, it’s the other way around. The query is passive. It sits there running, just waiting for events to arrive. The trigger isn’t someone clicking a button and running the query, it’s the data itself — an event emitted by some other system or whatever it might be. The interaction model is fundamentally different. So what if we combined them together?
Event Streams are the key to solving the data issues of modern distributed architectures
With that in mind, I want to dig a little deeper into some of the fundamental data structures used by stream processors, and compare how those relate to databases. Probably the most fundamental relationship here is the one between events, streams and tables.
Events are the building block and, conceptually, they are a simple idea: a simple recording of something that happened in the world at a particular point in time. So, an event could be somebody placing an order, paying for a pair of trousers, or moving a rook in a chess game. It could be the position of your phone or another kind of continuous event stream.
Individual events form into streams, and the picture changes further. An event stream represents the variance in some variable: the position of your phone as you drive, the progress of your order, or the moves in a game of chess. All, exact recordings of what changed in the world. By comparison, a database table represents a snapshot of the current state at a single point in time. You have no idea what happened previously!
In fact, a stream closely relates to the idea of event sourcing, this programming paradigm that stores data in a database as events with the objective of retaining all this information. If we want to derive our current state, i.e. a table, we can do this by replaying the event streams.
Figure 3: We can think of chess as a sequence of events or as records of positions.
Chess is a good analogy for this. Think of a database table as describing the position of each piece. The position of each of those pieces tells me the current state of a game. I can store that somewhere and reload it if I want to. But that’s not the only option, representing a chess game as events gives quite a different result. We store the sequence of events from the well-known opening position all chess games start from. The current position of the board at any point in the game can then be derived by applying all subsequent moves to the opening position.
Note that the event-based approach contains more information about what actually happened. Not only do we know the positions at one point in the game, we also know how the game unfolded. We can, for example, determine whether we arrived at a position on the board through a brilliant move of one player versus a terrible blunder of the opponent.
Now we can think of an event stream as a special type of table. It’s a particular type of table that doesn’t exist in most databases. It’s immutable. We can’t change it. And it’s append-only — we can only insert new records to it. By contrast, traditional database tables let us update and delete as well as insert. They’re mutable.
If I write a value to a stream, it is automatically going to live forever. I can’t go back and change some arbitrary event, so it’s more like an accounting ledger with double-entry bookkeeping. Everything that ever happened is recorded in the ledger. But when we think about data in motion there is another good reason for taking this approach. Events are typically published to other parts of the system or the company, so it might already have been consumed by somebody else by the time you want to change it. So unlike in a database where you can just update the row in question when data is moving you actually need to create a compensating event that can propagate that change to all listeners. All in all, this makes events a far better way to represent in a distributed architecture.
Streams and Tables: two sides of the same coin?
So events provide a different type of data model — but counterintuitively, internally, a stream processor actually uses tables, much like a database. These internal tables are analogous to the temporary tables traditional databases use to hold intermediate results, but unlike temporary tables in a database, they are not discarded when the query is done. This is because streaming queries run indefinitely, there is no reason to discard them.
Figure 4: Stream processors use tables internally to hold intermediary results, but data in and out is all represented by streams.
Figure 4 depicts how a stream processor might conduct credit scoring. There’s a payment stream for accounts on the left. A credit-scoring function computes each user’s credit score and keeps the result in a table. This table, which is internal to the stream processor, is much smaller than the input stream. The stream processor continues to listen to the payments coming in, updating the credit scores in the internal table as it does so, and outputting the results via another event stream. So, the thing to note is, when we use the stream processor, we create tables, we just don’t really show them to anybody. It’s purely an implementation detail. All the interaction is via the event streams.
This leads to what is known as the stream/table duality, where the stream represents history, every single event, or state change, that has happened to the system. This can be collapsed into a table via some function, be it a simple “group by key” or a complex machine learning routine. That table can be turned back into an event stream by “listening” to it. Note that, most of the time, the output stream won’t be the same as the input as the processing is usually lossy, but if we keep the original input stream in Kafka we can always rewind and regenerate.
So, we have this kind of duality: streams can go to tables, tables can go back to streams. In technologies like ksqlDB these two ideas are blended together: you can do streaming transformations from stream to stream, you can also create tables, or ‘materialized views’ as they are sometimes referred to and query those like a regular database. Some database technologies like MongoDB and RethinkDB are blending these concepts together too, but they approach the problem from the opposite direction.
How Stream Processors and Databases differ
A stream processor performs some operations just like a database does. For example, a stream-table join (see Figure 5) is algorithmically very similar to an equi-join in a database. As events arrive the corresponding value is looked up in the table via its primary key.
Figure 5: Joining a stream with a table.
However, joining two streams together using a stream-stream join is very different because we need to join events as they arrive, as depicted in Figure 6 below.
As the events move into the stream processor, they get buffered inside an index and the stream processor looks for the corresponding event in the other stream. It also uses the event timestamp on the event to determine the order in which to process them so it doesn’t matter which event came first. This is quite different from a traditional database.
Figure 6: Joining two streams.
Stream processors have other features which databases don’t have. They can correlate events in time, typically using the concept of windows. For example, we can use a time window to restrict which messages contribute to a join operation. Say we want to aggregate a stream of temperature measurements using windows that each span five minutes of time to compute a five-minute temperature average. It’s very hard to do that inside a traditional database in a realtime way.
A stream processor can also do more advanced correlations. For example, a session window is a difficult thing to implement in a database. A session has no defined length; it lasts for some period of time and dynamically ends after a period of inactivity. Bob is looking for trousers on our website, maybe he buys some, and then goes away — that’s a session. A session window allows us to detect and isolate that amorphous period.
Another unique property of stream processors is their ability to handle late and out-of-order data. For example, they retain old windows which can be updated retrospectively if late or out of order data arrives. Traditional databases can be programmed to do similar types of queries, but not in a way that yields either accurate or performant realtime results.
The benefits of a hybrid, stream-oriented database
So by now, it should be clear that we have two very different types of query. In stream processing, we have the notion of a push query: queries that push data out of the system. Traditional databases have this notion of a pull-query: ask a question and get an answer returned back. What’s interesting to me is the hybrid-world that sits between them, combining both approaches. We can send a select statement and get a response. At the same time, we can also listen to changes on that table as they happen. We can have both interaction models.
Figure 7: A materialized view, defined by a stream processor, which the second application interacts with either via a query or via a stream.
So, I believe we are in a period of convergence, where we will end up with a unified model that straddles streams and tables, handles both the asynchronous and synchronous and provides users with an interaction model that is both active and passive.
There is, in fact, a unified model for describing all of this, which we can create using something familiar like SQL, albeit with some extensions — a single model that rules them all. A query can run from the start of time to now, from now until the end of time, or from the start of time to the end of time.
Figure 8: The unified interaction model.
“Earliest to now” is just what a regular database does. The query executes over all rows in the database and terminates when it has covered them all. “Now to forever” is what stream processors do today. The query starts on the next event it receives and continues running forever, recomputing itself as new data arrives and creating new output records. “Earliest to forever” is a less-well-explored combination but is a very useful one. Say we’re building a dashboard application. With “earliest to forever”, we can run a query that loads the right data into the dashboard, and then continues to keep it up to date.
There is one last subtlety: are historical queries event-based (all moves in the chess game) or snapshots (the positions of the pieces now)? This is the other option a unified model must include.
Put this all together and we get a universal model that combines stream processing and databases as well as a middle ground between them: queries that we want to be kept up to date. There is work in progress to add such streaming SQL extensions to the official ANSI SQL standard, but we can try it out today in technologies like ksqlDB. For example, here is the SQL for push (“now to forever”) and pull (“earliest to now”) queries.
Figure 9: Push and pull queries in the unified interaction model.
While stream processors are becoming more database-like, the reverse is also true. Databases are becoming more like stream processors. We see this in active databases like MongoDB, Couchbase, and RethinkDB. They don’t have the same primitives for processing event streams, handling asynchronicity, or handling time and temporal correlations, but they do let us create streams from tables and compared to tools like ksqlDB they are better at performing pull queries, as they’re approaching the problem from that direction.
So, whether you come at this from the stream processing side or the database side there is a clear drive towards a centreground. I think we’re going to see a lot more of it.
The database: is it time for a rethink?
If you think back to where we started: Andreasen’s observation of a world being eaten by software, this suggested a world where software talks to software. Where user interfaces, the software that helps you and me, is a smaller part of the whole package. We see this today in all manner of businesses across ride sharing, finance, automotive — it’s coming up everywhere.
This means two things for data. Firstly, we need data tooling that can handle both the asynchronous and the synchronous. Secondly, we also need different interaction models. Models which push results to us and chain stages of autonomous data processing together.
For the database this means the ability to query passive data sets and get answers to questions users have. But it also means active interactions that push data to different subscribing services. On one side of this evolution are active databases: MongoDB, Couchbase, RethinkDB, etc. On the other are stream processors: ksqlDB, Flink, Hazelcast Jet. Whichever path prevails, one thing is certain: we need to rethink what a database is, what it means to us, and how we interact with both the data it contains and the event streams that connect modern businesses together.
About the Author
Ben Stopford is Lead Technologist, Office of the CTO at Confluent (a company that backs Apache Kafka). He has worked on a wide range of projects from implementing the latest version of Kafka’s replication protocol to assessing and shaping Confluent’s strategy. He is the author of the book Designing Event-Driven Systems, O’Reilly, 2018.