Do you even need Kafka?

Most likely.

Oct 28, 2024

Please pledge your support if you find this newsletter useful. I’m not planning to introduce paid-only posts anytime soon, but I’d appreciate some support from the readers. Thank you!

Recently, I’ve heard arguments like “Our solution doesn’t require Kafka” or “You can do data streaming without Kafka” from several vendors. I wanted to zoom in and understand if a streaming platform like Kafka is still needed nowadays (spoiler alert: it’s not going anywhere).

Which Kafka?

But first of all, ask people what they mean when they say Kafka. It could either mean Apache Kafka or another tool that supports Kafka API.

Historically, Apache Kafka was perceived as a fairly complicated system that was hard to deploy and maintain. Using Kafka in a small or medium-sized company was often considered to be overengineering (“of course there is a simpler way to do X!”). I guess it might still be the case, but things have changed drastically.

Modern Apache Kafka doesn’t need Zookeeper, Redpanda doesn’t need JVM, and WarpStream doesn’t even need any disks or state to manage. There are also Apache Pulsar, AutoMQ, Buf and probably other solutions that support Kafka API. Many have fully managed offerings.

So, let’s forget about the “Kafka is really hard to run” argument for a second.

Streaming databases

Coming back to the original statements, I frequently hear them in the context of streaming databases. For example, when using Materialize, you can get really strong consistency AND latency guarantees when connecting directly to Postgres. If all you need is to offload a complicated query or report from your read replica, then the Postgres & Materialize combo is sufficient. I do feel like it can be a powerful way to build user-facing data products: ingest data from an operational database, transform it into the right shape, and serve API requests using the underlying storage. OH: That microservice should have been a SQL query is still my favourite talk on this topic.

Or perhaps you just want to move your operational data from MySQL to Snowflake. It seems like the Change Data Capture tools (e.g. Debezium) made operational databases really popular as a data source. Even five years ago, application or clickstream events were much more favoured data source. But nowadays, thanks to reliable CDC tools, many companies want to see their operational data in their data lake or data warehouse first.

Also, some application events could be translated into CDC streams thanks to the Outbox pattern.

So, to summarize, streaming databases can act as a bridge between operational data (thanks to CDC capabilities), which could be messy, require normalization, etc., and a primary data lake / data warehouse.

The best data hub

But despite the increased popularity of operational data, a typical enterprise data platform consists of many more sources of data: not just application or clickstream events but also third-party data, SaaS vendor data, legacy applications data, and many more.

When I worked at Activision, we ingested game data from game consoles (like PlayStation), mobile devices, and our partners. The game telemetry we received could contain anything: match summary, marketplace purchase, GPU temperature or even binary crash dump data.

Messaging or streaming platform is still the best way to do data integration. Enterprise Integration Patterns is 20 years old, but it’s incredibly relevant. I doubt things will drastically change in the next 20 years. Yes, perhaps we’ll see data lakehouses play a larger role, but they won’t be able to replace realtime use cases (without evolving a lot).

Also, a streaming platform like Kafka is still one of the best ways to quickly and reliably ingest large volumes of structured or unstructured data.

Finally, a streaming platform like Kafka is great at dataset reuse. Consumers are very cheap. The same topic could be used for writing data to the data lake, powering an OLAP database, hydrating a cache, populating a search index, and emitting user notifications.

To be honest, I feel a bit silly repeating these features (like I’m a salesperson trying to sell Kafka). The industry should’ve learned the power of reusable data streams by now, but I guess it’s good to remind about it once in a while.

Verdict

If you just need to move your relational database data into a data warehouse, you don’t necessarily need to use a streaming platform like Kafka. However, any decently sized enterprise data platform would benefit from it anytime.

Events

I’ll be speaking at Redpanda’s Streamfest online conference in early December. You can register here.

Matt Helm

For a lot of smaller companies, streaming data usually comes from JavaScript tracking on websites or webhooks from SaaS tools, and tools like Segment or RudderStack tend to cover those needs out of the box. They handle web events and system events directly from an app—something that would often run through Kafka at larger companies. With features like HTTP web servers, JS tracking, debugging UIs, and batch exports to data warehouses, these tools work pretty well without requiring a full data platform setup.

I’m surprised Confluent or Redpanda haven’t targeted smaller analytics teams more. If they offered a higher-level option that reveals complexity as organizations grow, I could see Kafka working for smaller teams, with advanced features available as they grow.

Maybe there’s just not enough money in that market, so building a product mainly as a lead-gen tool rather than a direct revenue driver might be too much of a distraction. Still, I wouldn’t be surprised if Confluent eventually looks to capture this segment by buying RudderStack.

Expand full comment

1 reply by Yaroslav Tkachenko

Bogdan Veliscu

Oct 30, 2024

Nicu read! Would you you use a Kafka like solution for event sourcing to get the replayability?

4 more comments...

Data Streaming Journey

Discussion about this post