Data Streaming Journal

Can Kafka Queues Make Consumers Faster?

Yaroslav Tkachenko — Mon, 27 Apr 2026 15:45:07 GMT

For a while, messaging systems like RabbitMQ, ActiveMQ, and AWS SQS have been used as a foundational component of many enterprise architectures. Enterprise Integration Patterns is still one of my favourite books. I had my first exposure to messaging systems many years ago: I worked at a startup where we used a combination of ActiveMQ, Camel, and Akka to build event-driven microservices.

Then Apache Kafka became popular. The Distributed Log is a more generic abstraction, but many people have tried using it as a message queue, not always successfully.

Finally, Apache Kafka 4.x introduced Queues as a first-class citizen! And Queues are officially GA in Kafka 4.2.0.

Many more knowledgeable engineers have already extensively written about this:

And the official KIP-932: Queues for Kafka.

Instead of repeating after everyone, I wanted to explore one idea that immediately came to mind once I learned about the Queues.

What’s in It for Data Streaming Engineers?

Queues are a great building block if you’re building a workflow orchestrator, a task scheduler, event-driven microservices, etc.

But what about data streaming projects? Will Queues help with Spark, Flink, Kafka Connect, and similar technologies?

I thought they could be applicable in two ways:

Helping to build complex state machines. For example, if your stream-processing engine needs to perform a lot of work per message (e.g., calling an external endpoint, enriching data, performing conditional lookups), the ability to acknowledge or reject and reprocess messages is really valuable. Ideally, you also want to use the Dead Letter Queue (DLQ): it’s coming to Kafka as a first-class citizen too.
Scaling consumers beyond the partition count. The parallelism of Kafka consumers is generally limited by the partition count: you can’t have more instances consuming data than the number of partitions. There are some workarounds (like multi-threading consumers), but they’re quite complex and not supported in popular OSS projects like Spark and Flink. Queues seemed to offer a solution: a share group (a new Queues feature) can have more instances than partitions, and the work will be distributed accordingly.

I focus on the latter next. It seems great in theory, but does it actually work? I decided to test it.

The Benchmark

3-node Apache Kafka 4.2.0 cluster deployed using Strimzi Operator 0.51.0 on m8i.xlarge machines in AWS.
A single topic with 4 partitions and 400M records, each record is a JSON payload with 1KB average size.
Two Kafka consumer apps that read data from the topic (using the Bytes deserializer, so no JSON deserialization overhead) as fast as possible, perform some conditional logging, but nothing else (essentially no-op).
- The Standard Consumer app just uses the regular KafkaConsumer
- The Share Consumer app uses the new KafkaShareConsumer with implicit acknowledgement and batch_optimized share.acquire.mode.
- Both consumers have the standard high-throughput tuning (increased fetch.max.bytes, fetch.max.wait.ms, max.partition.fetch.bytes, etc.)

Kafka brokers were configured with:

group.coordinator.rebalance.protocols: “classic,consumer,share”
share.coordinator.state.topic.replication.factor: 3
share.coordinator.state.topic.min.isr: 2
group.share.partition.max.record.locks: 10000
group.share.record.lock.duration.ms: 15000
group.share.delivery.count.limit: 10

As I understand, group.share.partition.max.record.locks is especially important for high throughput - we’d like to process as many records as possible. 10000 is the maximum value allowed.

I ran several experiments:

Standard consumer (baseline): 1 instance and 4 instances. Can’t go higher because of the partition limit.
Share consumer: 1 instance, 4 instances, 8 instances. The goal was to compare with the baseline and observe the throughput increase with more instances.

The logic of the share consumer looked like this (a bit simplified):

AtomicBoolean running = new AtomicBoolean(true);

try (KafkaShareConsumer consumer = new KafkaShareConsumer<>(props)) {
    consumer.subscribe(List.of(topic));

    long consumed = 0L;

    while (running.get()) {
        ConsumerRecords records = consumer.poll(pollTimeout);
        consumed += records.count();

        // logging the number of consumed messages and the processing rate
        // every 5 seconds
    }
}

Results

As you can see, the standard consumer was able to achieve almost 1M rec/s throughput with 4 instances.

The throughput of the new share consumer is dramatically lower. We can definitely see the improvement when adding more instances, but the difference is still brutal: even eight instances of the share consumer are slower than a single standard one.

Analysis and Conclusion

So, why is the share consumer performance so different? I’m not a KIP-932 expert; I might be missing something (in which case, I welcome feedback!). But, as far as I understand:

Queues add additional overhead. They’re designed around the idea of acquisition locks, and the locks are managed on the broker side.
Most importantly, even though you can have multiple instances receiving the data on the consumer side, the fetching is still pretty much bottlenecked by the number of partitions:
Perhaps throughput could be improved if the fetch-from-follower optimization is implemented by the share groups, but it seems quite challenging (the state will have to be distributed).

Apache Kafka Queues is a very powerful abstraction, but as far as I can see, it won’t simply eliminate partitions as a scaling bottleneck for your consumers. At least, not yet.

Advanced Apache Flink

Advanced Apache Flink is an on-demand course focused on Flink internals, production deployment best practices, and advanced patterns.

Advanced Apache Flink On-Demand Course

Yaroslav Tkachenko — Mon, 23 Mar 2026 14:02:37 GMT

A few months ago, I announced the Advanced Apache Flink Bootcamp, a deeply technical training focused on Apache Flink internals, production deployment, and advanced patterns.

I ran the first cohort in January and also had the chance to conduct several corporate training sessions using the same program. I’ve received a lot of great feedback, iterated and polished my content.

Today, I’d like to announce the self-paced, on-demand version of the training!

Check the Advanced Apache Flink course. Expect more courses in the future!

Benchmarking CDC Tools: Supermetal vs Debezium vs Flink CDC

Yaroslav Tkachenko — Tue, 17 Feb 2026 15:02:33 GMT

If you’ve been following my writing recently, you likely know I’m very bullish on the Rust & Apache Arrow data stack.

So when I first learned about Supermetal (almost 2 years ago), I got really excited! Finally, some innovation in the Change Data Capture (CDC) space.

Recently, Supermetal announced support for Kafka sinks, making it a competitor to open-source tools like Debezium and Flink CDC.

I worked with the Supermetal team to run a series of independent benchmarks. This post summarizes my findings.

DISCLOSURE: this work was sponsored by Supermetal. All benchmarks were executed myself in my AWS account. All numbers and findings are shared as is.

About CDC Tools

Supermetal

Supermetal is a new CDC tool implemented using Rust and Apache Arrow. It’s very easy to use: it can be deployed as a single binary. Get the trial version here (includes 1000 hours of free sync).

It scales well vertically (by leveraging available CPU cores). It supports many popular databases and data warehouses. It doesn’t rely on Kafka or any kind of orchestrator: data can be delivered directly from a source to a sink (with optional object storage buffering). You can check its architecture here.

Supermetal supports both live (e.g. reading from a replication slot) and snapshotting modes. Snapshots are always parallelizable.

Supermetal can be configured using the built-in UI. REST API is also available. Finally, a JSON config file in the same format as the API can be used, which I chose to deploy (a better fit for containerized workloads). The config file just describes sources and sinks.

Debezium

Debezium is likely the most popular CDC tool in the world. It’s implemented in Java and typically deployed as a connector in a Kafka Connect cluster. This means it relies on Kafka: CDC data is first ingested into a set of Kafka topics, and then can be delivered to sinks via another connector.

It supports pretty much all relational databases and some non-relational ones.

Debezium supports both live and snapshotting modes as well. An important architectural detail: Debezium connectors (at least the most popular ones, such as MySQL and Postgres) can only be deployed as single-task connectors in the Kafka Connect cluster. Snapshotting can be parallelized by increasing the number of snapshot threads (a relatively new feature).

Debezium connector was deployed with a simple, flat config file.

Flink CDC

Flink CDC originally started as a collection of Flink CDC sources; nowadays, it’s a fully-fledged data integration framework. It’s also implemented in Java using Flink as the engine.

Flink CDC supports both live and snapshotting modes as well. For live mode, it mostly relies on Debezium, since Debezium can be deployed in embedded mode, which doesn’t require a Kafka Connect cluster. For snapshotting, Flink CDC uses a custom implementation that heavily relies on the Flink Source API. Most notably, this is the only of the three implementations that allows horizontal scaling of the snapshotting stage: input chunks (a range of a Postgres table, for example) can be processed in parallel across different TaskManager nodes.

In the case of the Flink CDC framework, you can use YAML-based declarative pipelines, but since I used it as a source connector, I needed to implement a pipeline programmatically.

Test Setup

As you can guess, I had a pretty trivial (and common) goal: replicate data from Postgres to Kafka.

I used the TPC-H dataset with a scale factor (SF) of 50. If you’re not familiar with it, it consists of 8 tables of different sizes. With SF=50, the largest table (lineitem) has 300M rows, the second-largest (orders) has 75M rows, and so forth.

On the infra side, I had:

AWS RDS Aurora Postgres 16, 48 ACUs (increased to 96 later).
AWS MSK with 3 express.m7g.xlarge brokers.
AWS EKS 1.34 using m8i.xlarge nodes (4 CPU cores, 16 GB RAM).
- All workloads (Supermetal agent, Kafka Connect node, Flink TaskManager) used a single node pretty much exclusively (configured to request 3.5 CPU cores and 13 GB RAM). Flink TaskManager used 4 task slots.

Regarding versions:

Latest Supermetal build (provided by the Supermetal team as a Docker image).
Flink CDC 3.5.0 with Flink 1.20 deployed using Flink Kubernetes Operator 1.13.
Debezium 3.4.1.Final with Kafka Connect 4.1.1 deployed using Strimzi Operator 0.50.0.

Generated Data

All three tools generated Kafka topics with JSON records. By default, Supermetal uses Debezium envelope schema, and I was able to confirm that it’s actually identical to what Debezium emits, not just payload fields, but message keys and headers too.

Flink CDC provides a standard JsonDebeziumDeserializationSchema for obtaining Debezium records as JSON, but you need to implement a Kafka serializer yourself. The serializer I implemented produced the same Kafka message payloads, but I skipped the message keys and headers, which likely somewhat affected the rates you see below.

Finally, I spot-checked data across topics and didn't observe any data loss.

Snapshotting Mode

I primarily wanted to test snapshotting performance; I expected to see the most drastic differences there. I also tested live mode, but skipped Flink CDC for it (since it essentially wraps Debezium, so performance would be roughly the same or lower).

Ok, let’s explore the benchmarks now!

Supermetal

The baseline run with the default configuration looked like this:

It finished in 72 minutes with 174 MB/s peak throughput and 105 MB/s average throughput.

Supermetal team recommended to also test with:

Disabled intra-table chunking (parallel_snapshots_enabled = false). For Kafka sinks, this improves throughput since Kafka partitions are the bottleneck, not table parallelism. This is typically not needed for sinks like data warehouses.
Producer pool size equal to the number of input tables (8).

Another run with the updated configuration finished in 60 minutes with 275 MB/s peak throughput and 123 MB/s average throughput. Spoiler: this is the best result I saw!

Flink CDC

The baseline run with the default configuration looked like this:

It finished in 210 minutes with 29 MB/s peak throughput and 22 MB/s average throughput. Pretty big difference compared to Supermetal, but it’s important to establish a baseline. Can we improve it?

First obvious optimization to try was tweaking the Kafka Producer configuration: using linger.ms of 100 and batch.size of 1000000. Actually, these are the values that are used by Supermetal by default, so it’s only fair to set the same here. But this change didn’t show any performance gains. I may have an explanation below.

Another thing I decided to try was scaling the job horizontally. I added three more TaskManagers, which increased total parallelism from 4 to 16. This led to almost linear improvement in throughput: I was able to consistently achieve 84 MB/s after ramping up. But, of course, it also means additional infrastructure.

Another optimization I tried separately was increasing fetch size (how many rows a connector polls at once) and chunk/split size (how many rows are logically grouped together for processing). I increased the fetch size to 5000 (from the default 1024) and chunk/split size to 50000 (from the default 8096). This led to 54 MB/s.

Debezium

The baseline run with the default configuration looked like this:

It finished in 170 minutes with 74 MB/s peak throughput and 43 MB/s average throughput. Also, much slower than Supermetal and comparable to Flink CDC.

I tried the same obvious optimization with tweaking the Kafka Producer configuration: using linger.ms of 100 and batch.size of 1000000. This time, it was beneficial: I was able to nearly double the throughput and consistently achieve 100 MB/s after ramping up

Another optimization I tried separately was increasing the number of snapshot threads, from 1 to 4. This also gave a nice boost, reaching 70 MB/s. Increasing further to 8 threads didn’t help though.

I also tried combining the tweaked Kafka Producer config and 4 snapshot threads, but it made things worse than the baseline. Too much contention, I guess.

Live Mode

The data generator I used was able to emit the same TPC-H data consistently at a given rate. It only wrote data to the top two tables: lineitem and orders.

Debezium

With the default config, Debezium could keep up with 15k ops/s. Things didn’t look good at 30k ops/s, the replication lag started growing.

I applied the same optimization and increased the producer’s batch.size and linger.ms. That made it possible to sustain 30k ops/s. The replication slot lag remained around 800 MB.

Unfortunately, I couldn’t get my data generator to write more than ~35k ops/s… Even after increasing the Postgres database to 96 ACUs. This is likely possible with the different Postgres setup, but I was happy with the numbers at that point.

I think live Debezium connector throughput could be improved further by increasing max.batch.size and max.queue.size config options, but I couldn’t test it. Also, it’s likely possible to lower the replication slot lag by reducing the flush time on the Kafka Connect side.

Supermetal

Supermetal was able to keep up with 35k ops/s using the default config. The replication slot lag stayed below 100 MB.

When it comes to live data, throughput is just one consideration. Latency can be as important. I admit it wasn’t the goal of my benchmark, but just looking at the replication lag, it seems very promising. Maybe we’ll have a part 2?

Other Notes

Supermetal used the CPU more efficiently (typically ~50% of the allocated 4 cores), while Debezium and Flink CDC mostly stayed around 25%.

Supermetal also used less memory (2GB), whereas Debezium and Flink CDC consumed much more (8GB - 10GB).

Analysis & Conclusion

Parallelization

If you compare the three graphs again, you’ll notice that Debezium and Flink CDC demonstrated the same behaviour: achieving a certain throughput level and roughly staying at that level during the test. This was true regardless of the optimizations I applied (they just affected the rate).

Supermetal behaved differently: you can see a big jump at the beginning, where it processes most of the tables in parallel, and then the throughput decreases as only the biggest table (lineitem) remains.

I think it means Supermetal can parallelize processing more efficiently, so it can likely achieve better throughput with larger tables.

Why is Flink CDC the Slowest?

Flink CDC seemed to be the slowest option, but why is that? It has a sophisticated snapshotting mechanism, but it still appears to be slower than Debezium.

Flink was designed to run at a large scale. There is definitely some overhead that affects workloads with the smaller scale - a distributed system almost always pays some coordination cost compared to a single-node architecture (this is why systems like DuckDB are almost always faster than Spark on small data, for example).

But I think, at least in part, it can be attributed to the amount of JSON serialization/deserialization it performs. Here’s the CPU flamegraph I took:

Purple color highlights JSON serialization/deserialization, which takes almost a third of the CPU. In general, when it comes to stateless streaming systems, data serialization/deserialization is the slowest part. This is especially evident here, since the data goes through a few serialization/deserialization roundtrips (first in Debezium, then in Flink). Jackson, the heavily used Java JSON library, is not the fastest out there. If you’re dealing with JSON in a high-performance Java system, check fastjson2 and simdjson-java.

I Expect Even Bigger Difference With Transformations

Supermetal leverages a highly optimized columnar Apache Arrow format, which will likely deliver even better performance when a transformation (such as a filter or projection) is involved, thanks to low-level compute kernels.

Supermetal doesn't support transformations yet, but it's on the roadmap. I'd love to re-run the benchmark when it ships. I believe that columnar data layout will make a huge difference!

Summary

Supermetal clearly delivers the best performance. Debezium and Flink CDC, once optimized, can get close. At the same time, Supermetal shows much better usage of allocated resources (better CPU and memory utilization).

If you still think that Rewrite Bigdata in Rust is just hype, maybe reconsider. I think we’ll see more tools purposely designed to run very efficiently on modern hardware.

If you need to optimize Debezium or Flink CDC pipelines, look into tuning the Kafka Producer configuration and consider the best way to parallelize snapshots.

It’s worth highlighting that Flink CDC has a working horizontally scalable snapshotting mechanism. If you’re willing to throw more ~~money~~ compute at the problem, you can likely achieve a similar (or higher) level of performance as Supermetal.

Horizontal scalability may seem like a must-have, but ultimately, your database will eventually become the bottleneck. Also, you can go really far by logically sharding your CDC workloads by table: I could’ve deployed 8 supermetal agents or 8 debezium tasks, one for each table. Definitely more painful to manage at scale.

I also have to mention the difference in developer experience: having a single binary and not relying on operators, clusters, and much configuration / glue code felt amazing!

Why Apache Flink Is Not Going Anywhere

Yaroslav Tkachenko — Wed, 03 Dec 2025 14:03:26 GMT

Astrologers proclaim the month of criticizing Apache Flink1, I thought to myself recently. In the last few months, I have read many posts on social media, company blogs, and newsletters that attack Flink from different angles. So, today, I’d like to reply with this statement: Flink is not going anywhere.

What Critics Say

Flink Is Too Complex

“Flink is a complex distributed system, it’s a mistake to start with it”, “Watermarking and windowing are so complex”, “Stateful Flink pipelines are so hard to manage, RocksDB is a PITA and requires tons of tuning”, “Schema and state evolution are impossible to get right”, “There are very few real low-latency use cases”, etc.

I can go on, but you got the gist.

First of all, complexity is not something you can measure. You can’t say that tool A is 37% more complex than tool B. Something that is complex for one individual/team/organization is not necessarily complex for another, and vice versa. People often cite Postgres as an example of a simple tool that does the job. However, I believe most of them don’t mean you should actually set up and manage a Postgres cluster yourself. Because managing Postgres, especially after reaching a certain scale, can be quite painful. Even managed solutions like AWS RDS don’t fully shield you from that.

Second, you need to differentiate essential and accidental complexity. Accidental complexity comes from lack of experience, over-engineering, frequent pivots, and tight deadlines. But essential complexity exists no matter what we do (that’s why it’s called essential). For example, everyone agrees that Kubernetes is a very complex technology. But I’d argue that most of the complexity is essential: reliable infrastructure and application management are just really hard by definition.

I believe the same applies to Flink. Why? Supporting a variety of data streaming use cases (from data integration to streaming ETL to real-time analytics to Complex Event Processing) is quite hard. It’s also one of the most popular Apache projects with thousands of contributors who run Flink in many different scenarios. When a project gets to be that popular for so long, it’ll inevitably reduce its accidental complexity with time; otherwise, it’d be impossible to keep adding new features.

But, most importantly, you need to see what Flink is being compared with. When many critics say that Flink is complex, they compare it with a proprietary, managed, paid service they offer. It’s easy to make that statement because you have:

On one side, an open-source technology that you need to deploy and manage yourself.
On the other side, a small surface area of a managed service.

And when it comes to operating Flink, you can’t avoid infrastructure management, SRE practices, observability, and CI/CD tooling. But the claims that it’s really hard to manage are, at least, outdated. You can choose from many vendors (all major clouds, Confluent, Ververica). Also, the Flink Kubernetes Operator makes it very easy to deploy and manage Flink in a Kubernetes environment.

There are also claims that Flink requires a “cluster setup” and that it’s “centralized”, therefore, you need a specialized team to manage it. I guess this stems from the fact that Flink has a JobManager component which communicates with many TaskManagers (or “workers”). But with modern infrastructure management tools like the Flink Kubernetes Operator, all of this is mostly abstracted from you. And managing Flink becomes almost as easy as managing a pool of stateless web applications. I said “almost” because every Flink pipeline is stateful in some way (e.g., storing Kafka consumer group offsets), so it requires an extra step to take or recover from a savepoint, etc. But it can be easily automated.

Another claim I’ve seen: Flink is complex because it guarantees exactly-once delivery. However, this is a connector-level property: some connectors offer at-least-once, some exactly-once, and some both. Exactly-once can complicate your end-to-end system significantly (e.g. Kafka sink uses Kafka transactions), but you can almost always fallback to at-least-once.

Finally, I’d like to add this: of course, you shouldn’t use Flink for simple use cases that don’t require it. And of course, you don’t need to use all the features that Flink provides. Also, Flink is a great, versatile tool; however, sometimes, it’s more efficient to use it with something else. For example, arguably, many aggregation-related use cases can be better supported by a modern OLAP database like ClickHouse. But in some cases, it makes total sense to use windowed aggregations in Flink: ML feature engineering and sessionization come to mind. As usual, the answer to “should I use this or that” is not binary; it requires planning and research2.

Nobody Needs Realtime

Data streaming is usually associated with low-latency, as opposed to batch processing. “Low-latency” can mean a lot of different things in many contexts, but let’s assume that in the case of Flink, we’re talking about sub-100ms processing latency. Many critics say that very few use cases actually demand latency like this; therefore, Flink is not needed.

And this is a completely wrong take.

Data streaming is not about latency. It’s definitely helpful to have relatively low-latency, but to me it can be 100 milliseconds, 1 second or even 10 seconds in most scenarios.

Data streaming semantics and continuous processing are important parts. The ability to process data streams with transformations that leverage continuous processing is extremely powerful. For example, I can design my data streams as changelogs and enjoy end-to-end upsert semantics that automatically correct data. I wrote about it here:

Modern data streaming semantics have evolved from decades-old messaging semantics. Enterprise Integration Patterns is a book released 20+ years ago, and I still recommend it today. This book explains that messaging is the best approach for integrating different systems, something that data systems still struggle with today. It also describes patterns that became the foundation for many of the tools we use today, including Kafka and Flink.

So, yes, we rarely need 50ms processing latency. But we need reliable ways to process data streams, and stream-processing tools like Flink are, arguably, the best answer we have.

Look at the Confluent Earnings Report!

For some reason, people think looking at how much Confluent makes by selling Flink is a good indicator of… I don’t really know what. Flink adoption? Popularity? Growth? Not sure. But it’s being used widely as an argument to support their claims.

Context: Confluent reported its Flink ARR (Annual Recurring Revenue) for Q2 of 2025 to be $10M, stating 3x growth from the previous quarter. Q3 2025 report shows another 70% growth. So, the full picture so far, as far as I understand it: $3M (Q1) → $10M (Q2) → $17M (Q3).

Is it a low number? I don’t know. What are we comparing it with?

I’m confident that most of the data infrastructure companies launched in the past few years that compete with Flink aren’t even remotely close to that ARR number. But sure, public company vs startups, not a fair comparison.

Anyway, whatever number Confluent reports is pretty much irrelevant to Apache Flink, the technology. Here’s why:

Confluent Cloud doesn’t support the DataStream API, which is still the most popular and widely used way to author Flink pipelines.
Confluent Cloud doesn’t support any Flink connectors except the Kafka ones (or, to be correct, Confluent Kafka ones: you can’t use Confluent Flink without Confluent Kafka). This, again, eliminates a huge number of reasons to use Flink in the first place (e.g. compared to Kafka Streams). The ability to connect to arbitrary data stores, and not just Kafka, is really powerful.
Their SQL / Table API Cloud offering seemed to target the wrong audience: you had to write Flink SQL in a browser in a notebook-like environment. This is not what most engineers would want to use. This is not a Flink problem; this is a product problem. Thankfully, they’re addressing it now with IDE extensions (and the API is getting more mature).

The “I Don’t Know Java” Argument Is Outdated

Don’t want to write Java? Just use Flink SQL.

The majority of the new data processing technologies tend to favour SQL. Many of them also implement Incremental View Maintenance (IVM) semantics: reusing previously computed results to avoid full re-processing.

And it’s quite common to see this comparison: “look at the complex Flink Java code on one side and this simple SQL query on the other side”. Of course, the SQL query typically looks more concise. But Flink SQL has been around for years; you should really compare with that!

Flink SQL has its own problems, of course. State evolution is still a big problem. But it’s generally a tough problem for any system: very few IVM tools can actually reuse internal state when the query is modified.

Other than handling state evolution, Flink SQL (and the Table API) is mature and stable. Netflix has 10000+ Flink SQL pipelines. Shopify heavily invests in Flink SQL. LinkedIn uses Flink SQL to unify streaming and batch queries.

I believe Flink SQL can cover over 70% of common Flink use cases. If you’re writing a new Flink pipeline, look at Flink SQL first (then the Table API, and only then consider using the DataStream API).

I agree that trying to bridge databases and stream-processing systems is the right direction. Adopting the SQL-first approach makes sense. You get declarative pipelines, you leverage query optimizations, you get fully schematized datasets, etc. But you can not, and you should not discard Flink because it’s not a database-native tool: Flink SQL gets you really close.

Flink Is Ubiquitous

No matter what critics say, Flink adoption is really strong. Pretty much all Big Tech companies use Flink (it’s a long list to type, but just to name a few: Apple, Airbnb, Alibaba, Booking, DoorDash, eBay, Instacart, LinkedIn, Lyft, Netflix, Pinterest, Reddit, Salesforce, Shopify, Stripe, Uber). Name the top 10 companies in pretty much any vertical, and most of them would use Flink.

You could say that it happens because of historical reasons or due to inertia. Look at OpenAI then3: they could’ve chosen any bleeding-edge tech for their data processing needs, but they ended up using Flink. There are plenty of startups that choose Flink too. My favourite example is Grepr: you won’t find Flink mentioned anywhere except the job postings, yet it’s in the core of the product.

I also don’t see many examples of companies switching from Flink to other technologies. Of course it happens, but it’s quite rare (or I just don’t hear about this, so let me know).

Flink Future: Process Table Functions (PTFs) and Disaggregated Storage

I believe that Process Table Functions (PTFs) will take Flink to another level.

PTFs give access to state and timer primitives inside the Table API. So it’s possible to implement low-level, DataStream-style operators and seamlessly integrate them into your SQL logic.

Just look at this example. No, seriously, look at it and come back.

It’s amazing, isn’t it? If you scroll a bit further, you’ll find the example with timers.

I’m not aware of any other data processing technology that exposes such powerful primitives as state and timers to user-defined functions (UDF). And not just exposes, but does it so elegantly. Most UDFs out there are simple, stateless bits of logic. But with PTFs, it’s possible to implement any kind of stateful operator (join, aggregation, etc.) optimized for your specific use case.

I can even imagine a PTF marketplace in the future!

Disaggregated state backend storage introduced in Flink 2.x is another major initiative that I’m excited about. Large stateful Flink pipelines can be hard to manage, especially when it comes to recovery. Flink follows the popular trend of shifting local storage to object storage, which makes it possible to have relatively cheap pipelines with a lot of state without thinking about local storage.

Disaggregated storage is still rough around the edges, but the upcoming Flink 2.2 release should address some of the issues. Even though it’s still marked as experimental, I’d consider it for some production workloads after 2.2 is out.

Conclusion

I believe Flink has a very bright future. I made a big bet on that.

You could say I am incentivized to present Flink in the best light possible. But even if it’s the case, there is nothing wrong with that - I’m sure my readers are capable of making their own decisions. I’ve been writing repeatedly about the RAD stack (Rust, Arrow, DataFusion) as the next iteration of data processing tooling. I truly believe that Rust, Arrow, DataFusion and related projects will slowly replace Flink, Spark and a bunch of other “classic” Big Data tools. Not in the next few years, though.

Flink has a lot to offer. That’s why I’m working on an advanced Flink bootcamp and training - I’d like to spread my knowledge as much as I can. This bootcamp will be great for you if you already have some Flink knowledge, but you’d like to become an expert.

Finally, some recommendations to the critics:

Do better than forming echo chambers. We can learn from each other.
If you criticize Flink, try to be specific and offer an alternative solution. It’s easy to criticize without offering anything.
Vague statements like “Flink is complex” and “nobody needs realtime” are not helpful.

Kudos to those few who got the HOMM3 reference.

Which is another way to say: it depends.

Sure it was founded in 2015, but it really started to grow only in 2023.

Current New Orleans 2025

Yaroslav Tkachenko — Mon, 03 Nov 2025 16:46:41 GMT

Current 2025 happened last week in New Orleans, USA. I’ve had a chance to attend it, and I’d love to share some of my notes with you.

Keynotes & Announcements

Confluent Intelligence likely was the biggest announcement. It’s a combination of three products:

Built-in ML functions for forecasting, anomaly detection, fraud detection, etc. These are implemented as Flink UDFs that call internal models. Some examples: ML_FORECAST, ML_DETECT_ANOMALIES, ML_PREDICT, etc. I believe that at least some of these functions have been supported for a while.
Streaming Agents. I’m not a big fan of this name. There is actually very little agentic functionality: if you think about it, these are Flink pipelines with UDFs that can call LLMs and MCP servers. That’s it. In my opinion, a truly agentic framework needs to support branching, loops, and, most importantly, somewhat non-deterministic execution that’s generated by the LLM output. So, workflow orchestration frameworks such as Restate, Temporal, DBOS, etc., are a better fit for this, in my opinion.
Real-Time Context Engine. A fancy name for what it seems like, an in-memory query engine (likely an Incremental View Maintenance system), which sits on top of hot (Kafka) and cold (Iceberg) data and exposes itself over MCP. This is the most interesting announcement for me, because the IVM system will actually be useful even after the current AI bubble bursts. Building a new IVM system is no small feat, and I’d love to learn more about its implementation!

And I have to mention Airy’s acquisition (acquihire?). Airy had a lot of experience with Flink AND LLMs, so this move makes a lot of sense.

Confluent Private Cloud is another big announcement worth mentioning. Confluent is going to offer a flavour of its cloud product for on-prem deployments. But I’m confused: they already have Confluent Platform for that (which still generates a lot of revenue), so will Private Cloud replace Platform? Complement it? Evolve it? Someone, please clarify. You could also say that WarpStream is another on-prem product offered by Confluent, so will they compete?

By the way, WarpStream was mentioned for about 10 seconds in a 2-hour keynote. I understand they still operate as a separate org, but they build a lot of cool features. Please share them with the world!

Another series of announcements was about Tableflow. Support for Delta Lake and Unity Catalog is now GA. They also released upserts and DLQs (Dead Letter Queues). It’s funny to see how much Confluent invests in better supporting Databricks as a partner, while Databricks is working on a data ingestion solution that doesn’t require Kafka…

Most of the other announcements can be found here.

Other random observations:

Jay Kreps (CEO @ Confluent) noticed the importance of Flink batch jobs for iterating on data.
Shawn Clowes (CPO @ Confluent), about Flink: “One of the most successful products we’ve ever launched as a company”.
Also from Shawn: we have “the largest number of connectors in the entire streaming ecosystem”. Seems like it’s targeting Redpanda Connect (which claims a higher number of connectors).

AI

Stateful computations such as aggregations, joins and windowing, as well as products like Streaming Agents and Real-Time Context Engine, enable Flink to do one thing really well: build context for LLMs. I don’t think it’s currently capable of much more than that, but building real-time, highly personalized and relevant context is already a big win. As I mentioned above, I don’t believe we should be calling it an agentic framework, but does it really matter?

“Just” being the best way to build LLM context should make data streaming extremely attractive in the current AI hype cycle.

Confluent announced Real-Time Context Engine last week, but some companies have been working on similar approaches for a while. For example, check the DeltaStream blog, which shares many concrete use cases for building context using Flink stateful operations and then calling LLMs.

Of course, aside from the announcements, AI was a hot topic in many private conversations I had. In person, most people are not very impressed with the current AI tooling (just as I am), to say the least. Stanislav Kozlovski nicely summarized my concerns, and he didn’t even touch on the moral or environmental considerations. And, sometimes, I feel like Emma Thompson.

I can’t say what’s going to happen, but I feel like we’re in some sort of bubble. If you overly rely on AI tools right now, my suggestion is to have a backup plan (for when the underlying service either completely disappears or increases its price by 10x - 100x).

Interesting Talks

Here are some of the interesting talks I’ve had a chance to attend.

FlinkSketch: Democratizing the Benefits of Sketches for the Flink Community

This was a short, lightning talk about FlinkSketch: a library of sketching algorithms for Flink. I think this library deserves more attention: it’s very common to use Flink for streaming analytics and generating ML features, both of which typically rely on windowed aggregations.

Sketching algorithms can significantly reduce resource usage (e.g., memory) while slightly sacrificing accuracy. This is a great compromise for many workloads.

Change Data Capture at Scale: Insights from Slack’s Streaming Pipeline

This was a talk about the Change Data Capture architecture for Slack’s sharded MySQL setup (Vitess). I really liked the very methodical approach to performance optimizations; it can be generally applied to most data-intensive applications.

Making agents, workflows, and event-driven apps as simple as SpringBoot, with Restate

The latest demo of Restate, a durable execution engine. Even if you’re somewhat familiar with the product, it might still be worth watching: I was surprised to discover it now comes with a very powerful UI for introspecting any aspect of execution.

I’m not sure if it was recorded at all, but if it was, watch the recording until the end. You may hear a Grammy-winning marching band crashing the talk (only in New Orleans!)

StreamLink: Real-Time Data Ingestion at OpenAI Scale

Insightful talk about OpenAI’s Flink infrastructure in the context of data ingestion. OpenAI uses YAML files for topic configuration (surprise) with the ability to add any topic for data lake ingestion. I also liked the clever use of Airflow sensors to mark partitions as completed.

In addition, they covered many details about their Flink control plane called Watchdog, which relies on the Flink Kubernetes Operator. Some great learnings here about the Flink Table API challenges and issues with job restarts.

Sizing, Benchmarking and Performance Tuning Apache Flink Clusters

Great talk from Robert Metzger, Flink PMC. A lot of it was about establishing a systematic approach to performance engineering, including setting a baseline, running experiments, and employing first-principles thinking. He also talked about choosing between horizontal and vertical scaling, tuning RocksDB and Kafka producers, as well as tweaking Flink memory.

If you are interested in improving the performance of your Flink pipelines, you may want to scroll to the end of the newsletter 😉.

Press Play on Data: Netflix’s Journey from Streams to Gaming Insights

Great talk about Netflix’s Streaming SQL (Flink) infrastructure and a specific gaming insights use case. I like how many guardrails and validation checks it has, which are absolutely necessary when operating thousands of Flink pipelines.

Many gaming-related challenges reminded me of the challenges we faced back at Activision: even though the games could be very different, collecting, analyzing and storing telemetry data can be similar.

Redpanda Resistance

As you may recall, Redpanda was banned from Current London earlier this year. It looks like the ban is still in effect, but Redpanda is not giving up.

In addition to buying ads in the New Orleans airport (which happened to run after the Confluent ones), they also set up a camp directly across the street from the conference venue. They had puppies, ice cream and good vibes.

By the way, a day before Current, Redpanda announced a new Agentic Data Plane, as well as the Oxla acquisition, which will allow them to offer a SQL query engine on top of their data. I’m particularly excited about it because SQL is useful not just for agents!

You can learn more at Redpanda Streamfest later this week.

Confluent Catalysts

I’m thrilled to be in the Confluent Community Class of 2025 - 2026! I attribute this to the newsletter you’re reading, so thank you A LOT for that!

Upstream: New Webinar Series

I was fortunate enough to be the first guest on a new webinar series called Upstream by Jan Siekierski, you can watch the first episode here.

Data Streaming Academy

I’ve recently announced Data Streaming Academy: the best place to master data streaming technologies. Join the waitlist now to be notified about the upcoming Advanced Apache Flink bootcamp.

Announcing Data Streaming Academy

Yaroslav Tkachenko — Mon, 27 Oct 2025 16:18:13 GMT

Over the past few years, many of you have suggested the same thing: “You should write a book or launch a course.” It’s something I’ve always taken seriously (even started writing a book during Covid), but the time commitment never quite aligned with having a full-time job.

This past year, however, has been one of exploration: building a developer tooling company, consulting, and doing a lot more writing. So I started thinking: maybe this is the right moment to experiment with education too.

So today I’m announcing Data Streaming Academy: the best place to master data streaming technologies. I’m going to kick off with a hands-on, deeply technical development bootcamp dedicated to Apache Flink: Advanced Apache Flink Bootcamp.

There are many resources online dedicated to Flink, but most of them are only helpful when you’re just starting. It’s really hard to find advice and best practices based on the years of experience actually running Flink in production. And this is exactly what I offer. We’ll go deep into the DataStream and Table APIs, understand the internals of connectors, and, most importantly, learn how to make Flink pipelines resilient, scalable, and observable.

So, here it is: StreamAcademy.io.

If you’re even remotely interested in learning Flink or other data streaming technologies, join the waitlist - it’s the only way to get early access (and discounted pricing).

And if you’d like to influence which modules or future courses come next, please share your input by filling out this form.

Flink Forward 2025

Yaroslav Tkachenko — Wed, 22 Oct 2025 15:51:10 GMT

Flink Forward conference happened last week in Barcelona, Spain. It’s the main Apache Flink event organized by Ververica, which includes many practitioners from companies like Netflix, Apple, Shopify, LinkedIn, etc.

Ververica Announcements

Ververica has made several announcements during the keynote:

Apache Fluss is now available in the Ververica Platform. This makes a lot of sense given the significant investment in the project.
VERA-X: A native vectorized Apache Flink engine. I share some thoughts below.
“Real-Time AI With Rag And LLM Support”: It sounds fancy, but in practice, it typically means having a few UDFs that call OpenAI APIs, which is not that exciting. Also, RAG is so last year! 🙂

The rest of the keynote was a bit underwhelming, but I did appreciate Ben Gamble’s whirlwind demo of the latest features, like the Delta join.

“AI co-host” was very cringe. Please don’t do it.

VERA-X

Here’s my reaction:

As I said, VERA-X is not a new product; Alibaba has been working on it for years (under the name Flash).

After studying the implementation details, I was surprised to realize how similar it is to Iron Vector. E.g., having specialized Row-to-Column and Column-to-Row operations, columnar UDF support, a memory manager, etc.

I’m really excited about this direction. People have tried to replace Flink with new stream-processing engines, but they haven’t succeeded. I think Iron Vector or VERA-X is a healthier approach to getting a significant runtime upgrade.

The Atmosphere on the Ground

The event felt on the smaller side. Only a handful of sponsors and no real “expo hall”. When talks started, the hallways immediately emptied, which to me indicates the large number of engineers in attendance. And whenever I talked to someone in the crowd, they almost always ended up being an engineer or a manager closely involved in using or building on top of Flink.

I also heard many questions like “Why do you think data streaming is not getting enough adoption?”. I have some thoughts, so stay tuned. In the meantime, I’d be curious to compare this event to Current next week.

Interesting Talks

The Big State Monster: Taming State Size in Multi-Way Joins with FLIP-516

Great overview of a problem with large joins: they can grow in complexity and state size very quickly. For example, multi-step joins are usually translated into a chain of binary joins, which explodes the state.

The new multi-way join operator can alleviate this problem (and multi-way joins can also be chained to support complex topologies).

This reminds me of the work we did at Shopify to support joins of 10+ streams.

Apache Fluss and the Seven Deadly Sins of Streaming Analytics

Great talk. The reasoning behind Fluss is solid - I’m sold! The adoption is lacking, and the fact that Fluss doesn’t support the Kafka protocol (and may never will!) is not helping. Still, I hope Fluss will get more recognition.

Petabytes, Pipelines & PyFlink: How We Stream, Enrich & Validate Billions of Events

Good talk showing how to build data pipelines with PyFlink. Mentions advanced concepts like data enrichment, DLQs, etc.

Redefining Flink Reliability — Blue/Green Deployments in Production

Blue/Green deployments were one of the key features absent from the Flink Kubernetes Operator. However, basic support for Blue/Green has been added in recent versions. This is a good talk covering the reasons why you may want to use it, along with lots of gotchas and implementation details. Make sure to listen to the Q&A after the talk.

Democratizing Flink SQL at Shopify: Scaling Streaming for Every Developer

In my opinion, my ex-coworker Ryan delivered the best talk of the conference.

Shopify has been busy building an impressive developer experience for Flink. A custom VS Code extension is at the center of it.

It offers:

Flink Notebook experience right in your IDE
Flink Catalog integration
Access to local and remote Flink session clusters
CLI tool for creating UDFs

It seems to be inspired by this project from eBay. Shopify is planning to open-source its extension as well.

The talk also covered a brief history of Flink at Shopify, Stream/Batch unification efforts and their K8S setup for Flink.

Flink SQL 2025: Powering Real-Time AI and Stream Processing Innovations

A good overview of the latest additions to Flink SQL: things like CREATE MODEL and ML_PREDICT for working with LLMs, VARIANT type for making it easier to work with JSON, Delta and Multi-way Joins.

Powering Stateful Joins at Scale with Flink SQL at LinkedIn

This talk describes some of the internals of the managed streaming SQL platform at LinkedIn, focusing on practical challenges when running a large stateful Flink pipeline.

A few interesting observations:

LinkedIn still uses Flink 1.16 (released exactly 3 years ago!)
They switched from SATA SSDs to NVMe SSDs to boost IO performance (duh). Whenever someone asks me how to speed up a stateful Flink pipeline, my top answer is always: use the fastest SSDs you can get.

Dynamic, Scalable, and Schema-Evolving: Introducing the Flink Dynamic Iceberg Sink

A must-watch if you use Apache Iceberg sink with Flink.

There were many more great talks I didn’t have a chance to attend. The recordings will be available soon, and I highly recommend checking them out.

Irontools

Iron Vector is a native, columnar, vectorized, high-performance accelerator for Apache Flink SQL and Table API pipelines.

It’s easy to install, requires no code changes, and can increase compute efficiency by up to 2x (as of now).

Check the announcement here.

Events

Find me at Current, New Orleans 🇺🇸 next week (October 29th - October 30th).

Streaming and the RAD Stack

Yaroslav Tkachenko — Mon, 22 Sep 2025 15:56:40 GMT

Earlier this year, I wrote a post about exploring Apache DataFusion as a foundation for a streaming framework:

At the time, it was mostly theoretical: I had started working on a streaming framework for a client of mine, and I only had a couple of months of Rust and DataFusion experience.

Over the course of this year, the framework became feature-complete, and it’s being integrated as a core product offering. I hope to write more about it one day; there is a plan to open-source it eventually.

At the same time, I started to work on the new Irontools extension called Iron Vector. It’s a native, columnar, vectorized, high-performance accelerator for Apache Flink SQL and Table API pipelines. I’m building it with Rust, Arrow and DataFusion. Feel free to check the announcement post with more details here.

So, this post is my attempt to summarize my actual, hands-on learnings from building several streaming products with what I call the RAD Stack: Rust, Arrow, DataFusion.

Following Up

First, let me follow up on a few key areas I identified in the previous post.

Checkpointing and Fault Tolerance

I implemented the classic Chandy–Lamport style algorithm for checkpointing. I also added a simple, pluggable state backend (with Postgres support being the target).

The main use case for this was storing Kafka consumer group offsets, but only when all nodes in a pipeline acknowledge processing (up to a certain epoch). You get at-least-once delivery guarantee.

The implementation was pretty straightforward. It’s important to have decent test coverage to catch edge cases.

I really liked Postgres as the main state backend: it’s perfect for storing small amounts of data (like the consumer offsets). If you need to modify the consumer position, you just modify a table row in Postgres.

If I were to look at storing larger amounts of data, I’d seriously consider using SlateDB.

Scaling Out Beyond a Single Node

As you probably guessed, we ended up relying on Kafka consumer groups quite a bit, Kafka Streams style. You can spin up many pods in parallel; each gets its own share of partitions. Adding autoscaling can be pretty straightforward.

In other scenarios, we could rely on the data source characteristics to parallelize data processing. E.g., if a data source supports range scanning, it’s possible to spin up several pods in parallel and assign a different range to each. It’s definitely not a generic solution.

Also, in many situations, having just a single pod was enough. You can go pretty far with vertical scaling.

I didn’t have a chance to explore Ballista or Ray.

Connectors: Building Your Own

I built several connectors (typically implemented as TableProviders), but there are no interesting insights to share: you take an external client, wrap it into DataFusion primitives, hook it up to the checkpointing system, and it’s done. Converting to/from Arrow can be tricky, but Arrow support is getting better1.

I do want to share my experience about using datafusion-table-providers: a somewhat official repository of the community “connectors” (TableProviders).

My experience aligns with the observation I’ll make below in the post: there is just not much awareness about streaming use cases at the moment. Look at the way the Postgres sink is implemented:

It starts a transaction
It writes all input RecordBatch records as inserts. It keeps writing until the input stream is terminated:
- while let Some(batch) = data.next().await
Finally, it commits the transaction

Can you see the problem? The input stream never ends when running in the streaming environment. So this logic has to be modified to introduce triggers (e.g. record-based or time-based) to periodically commit the current transaction and start a new one.

Implementing a Plugin System

If you build a connector (or any kind of library) using Java, you typically just need to:

Implement a certain interface
Package it in a JAR file
Make it available in the classpath, either by using a dependency manager or simply adding the JAR file to the classpath

This is not the case with compiled languages like Rust. With Rust, the path of the least resistance is building static binaries. This means recompiling your application every time you change any of your dependencies, including external libraries. This can be very painful at scale or when dealing with external contributions.

Using dynamic, not static linking, is the typical answer to this problem. However, dynamic linking with Rust comes with its own challenges. Rust ABI is unstable, so a minor Rust version difference (or even having different compiler flags enabled) can lead to compatibility issues.

Using Foreign Function Interface (FFI) and crates like the abi_stable is the standard workaround. But it’s not for the faint of heart! There is a great post series called A Plugin System in Rust that walks you through the implementation end-to-end, covering many learnings along the way. However, even if you follow all best practices, you can still end up with many unsolved problems. For example, there are no good solutions to share Tokio runtimes. So, if your plugin needs to perform async functions (and it almost certainly does), you have a choice between creating a Tokio runtime per plugin (which becomes prohibitively expensive with every plugin you add) or exposing a small subset of functionality via FFI-safe structures (thanks to async_ffi).

Whew, I know. It’s messy.

Performance Gain Is Visible

Performance is one of the reasons we’re doing it in the first place, and oh boy, it delivers. I ran hundreds of benchmarks for the past few months, and it’s not uncommon to see 2x, 3x or even 5x throughput increase after rewriting something with the RAD stack. I believe Arrow is one of the reasons behind it. Columnar, vectorized execution can be much more efficient for streaming workloads.

This aligns with benchmarks by Arroyo (which claimed 3x - 5x throughput increase) and RisingWave (with 2x throughput increase as a norm for stateless workloads).

DataFusion is not that high in the ClickBench results, but it doesn’t mean it’s not fast. It’s just that there are faster solutions for certain types of queries, but none of which come even close when it comes to extensibility, which is covered in the next section.

Extending DataFusion: Lessons Learned

DataFusion was initially designed as a batch query engine. There were many improvements over the years to make it more compatible with streaming semantics. Sometimes, the batch nature is very obvious (e.g. operators with EmissionType::Final: Records are only emitted once all input has been processed). But sometimes, it’s so subtle that it’s very hard to notice until you hit issues in production.

Pay Attention to Your Functions

DataFusion has a variety of standard system functions commonly found in any SQL database: comparison, string, math, etc. Most of the functions are either immutable (always return the same output when given the same input) or volatile (may change the return value from evaluation to evaluation).

There are also several stable functions. From documentation:

A stable function may return different values given the same input across different queries but must return the same value for a given input within a query.

These functions are now(), current_time() and current_date(). From the batch engine perspective, it may make sense to keep the value of now() the same during the execution of a given query. However, in the streaming context, we only have one never-ending query! This means that by default, now() will always return the same value (captured as the beginning of execution), even if called a week after the streaming query was started.

Thankfully, it’s very easy to come up with custom implementations of those functions that are basically the same as the built-in ones, but marked as Volatile. Then it’s trivial to override a system function with a custom function that has the same name: DataFusion doesn’t really differentiate between system and user-defined functions. They use exactly the same API and registration mechanism.

Stateless and Incremental Doesn’t Mean Streaming Friendly

Projections and filters form the foundation of stateless stream processing. Surely such primitive operators support streaming execution. Right? Right???

Well, kinda. Let’s look at the filter (FilterExec operator). It propagates the input emission type, so if the input emits data incrementally, the filter operator does the same.

However, there is one implementation detail that can come back to haunt you. If the filter operator filters out all rows from the current batch, it doesn’t emit an empty batch. Instead, it just waits for the next batch that’s not empty 🫠. This is ok for the batch engine, but a streaming system could be relying on a continuous stream of batches (even empty ones) for many reasons (e.g. observability).

And again, DataFusion makes it very easy to replace the built-in filter operator with a custom one. Just fork it, tweak the logic, and then add a PhysicalOptimizerRule that replaces any FilterExec with your custom operator. It takes a few lines of code. Seriously, is there a query engine out there that allows you to replace the operator behind such an important, fundamental feature (a WHERE clause) with this kind of simplicity?

Not Optimizing Is Sometimes a Good Thing

Imagine you have an Iceberg table as a data source and you want to run two queries against it. Naturally, these queries have different projections and filters. Any reasonable query engine implements optimizations like projection and predicate pushdown, which limit the amount of data returned by the source. From the query plan perspective, you query the same table twice, but you expect different outputs.

Now, take a Kafka topic as a data source. Kafka doesn’t support projection or predicate pushdown2. And when you issue two queries with different projections and filters, you actually DO want to query this source once, and apply the projections and filters later. I believe it’s called scan sharing in some literature. But it’s not the way DataFusion (and most query engines) are designed to behave, so you end up reading the same topic multiple times, even though it could’ve been done just once.

I haven’t implemented a good workaround for this just yet, but I believe it’ll involve disabling certain optimizer rules and broadcasting the data from one operator to many. Btw, Apache Flink has scan sharing implemented: you can find the reuse marker in its query plans.

I could go on, but I think these findings paint a clear picture: DataFusion is extensible enough to support anything you want, but it takes some effort to get there.

Community Alignment Is Not Quite There

As you probably understand by now, streaming support is not a very high priority for the DataFusion project. It feels like, sometimes, it happens to support some streaming primitives by accident (and because it’s actually a very good thing to do even in many batch-oriented scenarios).

However, the community is generally interested in this:

So, I think it’s just a matter of time and the number of contributors involved. Folks from Synnada have been contributing a lot on the streaming side, and I’m very grateful for that! I hope that more contributors can start thinking about streaming scenarios, e.g. modifying that Postgres sink to support streaming execution can actually be beneficial for batch workloads too (you don’t want to keep open transactions for long).

Conclusion

Rewriting Bigdata in Rust is slowly happening. Arrow is becoming the standard for data exchange. I haven’t even mentioned Arrow Flight, ADBC and Substrait (which Iron Vector uses).

Overall efficiency (and cost efficiency in particular) has been quite a focus in the past few years. I hope the RAD stack is here to stay: let’s try to get more from the infrastructure we have.

Irontools

Iron Vector is a native, columnar, vectorized, high-performance accelerator for Apache Flink SQL and Table API pipelines.

It’s easy to install, requires no code changes, and can increase compute efficiency by up to 2x (as of now).

Check the announcement here.

Events

Find me at the following events next month:

Polyglot Unconference, Vancouver 🇨🇦, October 11th
Flink Forward, Barcelona 🇪🇸, October 15th - October 16th
Current, New Orleans 🇺🇸, October 29th - October 30th

Also, this is one of the areas where LLMs can really shine: hand-writing boilerplate conversion logic from one format to another is no fun.

You could filter data by partition or by the Kafka message timestamp, but that’s pretty much it.

No More Excuses for Stream/Table Duality

Yaroslav Tkachenko — Mon, 18 Aug 2025 16:02:02 GMT

Last week, Aiven announced a free, open-source implementation of Apache Iceberg support for Apache Kafka topics. This is such an important milestone that I decided to take a quick break from building Irontools and write this short post.

Many Kafka vendors already support Iceberg sync (Confluent, Redpanda, Bufstream, Streambased), but this seems to be the first free and open-source implementation1.

It’s not quite production-ready at the moment (it needs to at least support basic schema evolution), but I imagine they’ll catch up quickly.

I don’t want to rephrase many posts about why Iceberg is important for Kafka. Feel free to check Aiven’s announcement, as well as posts from other companies:

Introducing Tableflow from Confluent
Apache Iceberg Topics: Stream directly into your data lake from Redpanda

Single-Copy Implementation

Perhaps the most interesting implementation detail mentioned in the whitepaper is the absence of data copying. Log segments are converted to Parquet files and stored in the object storage. These files are used both by the Iceberg reader, as well as the Kafka Tiered Storage components that serve Kafka requests. This likely means extra work on the fly, and it must affect read latency. This may sound like a bad idea, but I think it’s already proven - Streambased works in a similar fashion.

Streaming-First Architecture

I wrote about Streaming-First Architecture last year:

I believe it’s a very powerful way of building streaming data pipelines. However, I identified LakeHouse support as the last missing building block. But it’s no longer missing! I didn’t want to say this when Confluent or Redpanda made their announcements: in the end, those were proprietary products. But with Aiven’s latest release, we also have a free, open-source implementation that will surely drive the adoption. Maybe we’ll even see some unification (e.g., a standard set of Kafka metadata columns… a man can dream).

Data Platforms in 2030

And here’s another post I wrote a year before that:

In that post, I described the idea of a “consolidated data engine”: tightly integrated streaming log (e.g. Kafka), a LakeHouse (e.g. Iceberg) and a fast OLAP database (e.g. ClickHouse). You can build almost any data product with that combo.

It’s 2025, and it seems like we’re already 2/3 there 😎. I’m really excited about the next 5 years!

I’m building Irontools: a suite of Apache Flink extensions to make your streaming pipelines faster, cheaper, and more flexible.

For Apache Kafka. There are Kafka-compatible products like AutoMQ that released their support earlier.

Current London 2025

Yaroslav Tkachenko — Thu, 29 May 2025 16:12:24 GMT

Current London 2025 happened last week in London, UK. I was fortunate enough to attend it, and I’d love to share some notes with you.

Key Themes

Of course, AI was everywhere: keynote, talks, and hallway chats. AI for coding, AI agents, AI for writing Flink jobs (!). It seems inevitable for any tech conference nowadays, regardless of its focus. To me, this is both exciting and scary; I’ll expand on this another day.

Iceberg, Tableflow, Delta Lake, and similar tech are still extremely relevant, but people seem to be more informed nowadays. The current version of Iceberg is not ideal for streaming, and it’s just terrible for changelog data streams.

Finally, I feel like the data streaming industry is still in a tough spot. The growth is slow, and the sales cycles are long. One person I spoke with said that “80% of the companies in the Expo hall will be dead in two years”. I don’t want to believe them, but it might be true.

Keynote

The keynote didn’t have any big, jaw-dropping announcements (I hope to see them in the next Current in New Orleans).

AI was obviously mentioned several times. Confluent’s messaging stays consistent over the years: AI needs data, specifically real-time data. But this year they introduced the idea of Flink jobs as AI agents! Coincedentaly, at the same time, FLIP-531: Initiate Flink Agents as a new Sub-Project was introduced, and 2/3 authors are Confluent employees. This tells me that Confluent is serious about the agents idea getting open-source adoption quickly.

One of the most interesting announcements for me was the snapshot queries. This is how I understood it:

Confluent Flink SQL queries can operate in a “snapshot” / batch way: they don’t run continuously, but stop after getting initial (?) results.
If Tableflow is enabled for a source topic, the query leverages the underlying Iceberg storage first, and then switches to Kafka, if needed (?).

This makes a lot of sense! I’ve been sharing this pattern for a while: Flink’s HybridSource makes it really straightforward to implement. It’s nice to see this as a fully managed product feature.

I also captured several great quotes during the keynote:

“Apache Flink is the key to making shift left practical” (Shaun Clowes).
“With Flink, what was possible in the analytical estate, now is possible in the operational estate” (Shaun Clowes).
“We don’t need ETL, we don’t need ELT”, in the context of Tableflow (Shaun Clowes).
“With Tableflow, your streams are tables” (Ahmed Saef Zamzam).
“Companies are becoming software” (Jay Kreps).

Talks

Here are a few solid talks I had a chance to attend:

Flink Jobs as Agents 🤖 – Unlocking Agentic AI with Stream Processing. As I mentioned above, AI was everywhere. This talk showed a way to leverage Flink for building agents: using various sources and transformations to build context, performing actions using that context, and then consuming events emitted by those actions, which potentially affects the context. Steffen also mentioned using AI for creating Flink SQL pipelines (!) and Complex Event Processing (CEP).

Building Stream Processing Platform at OpenAI. Great coverage of OpenAI’s data platform and their challenges with PyFlink. OpenAI introduced proxies for consumers and producers, hiding cluster abstractions1. This has a number of benefits, like straightforward HA, better scaling, etc. OpenAI heavily uses PyFlink; they shared their concerns about PyFlink's efficiency and lack of some features.
FlinkSQL Powered Asynchronous Data Processing in Pinterest’s Rule Engine Platform. Pinterested shared some insights about their rule engine, which helps fight spam (one of many use cases). It was interesting to see backfilling mentioned as a first-class citizen: this is a very mature and pragmatic decision, I’d like to see more data streaming projects acknowledging it.
Unified CDC Ingestion and Processing with Apache Flink and Iceberg. How far can you go to engineer a feature that’s not supported out of the box? Medidata Solutions and Decodable gave a masterclass 🙂. They had to engineer a sophisticated system to handle changelog data streams in Flink Iceberg, which is not supported out of the box2.
Simplifying Real-Time Vector Store Ingestion with Apache Flink. This was another masterclass, specifically on writing SQL UDFs. I loved detailed code snippets shared by Hans-Peter: we need more hands-on talks like this one.
Democratising Stream Processing: How Netflix Empowers Teams with Data Mesh and Streaming SQL. Netflix has been working on their internal data streaming platform for a while. It currently handles 14 trillion records a day (~160M/s on average), which is very impressive! Sujay talked about their latest initiative called “Data Mesh”3, which offers a high-level interface for defining streaming sources, sinks, and processors (using Flink SQL). I liked how they rigorously rely on schemas to prevent breaking changes. Also, Data Mesh has some really neat features: Iceberg Lookup join, query preview, revision history, and autoscaling. Unfortunately, they still haven’t figured out updates for stateful Flink SQL jobs.
Flink SQL Revolutions: Breaking Out of the Matrix with PTFs. This talk is similar to previous talks by Timo about PTFs, but it has many neat examples! So, if you’re interested in learning about PTFs, I recommend this talk. At the end of it, I was convinced that when PTFs are mature and polished, the need to use the DataStream API will be significantly reduced.

There were so many great talks I haven’t had a chance to watch! I’m eagerly waiting for the recordings of these:

From Zero to Hero: petabyte-scale Tiered Storage lessons.
Building Stream Processing Platform at OpenAl.
Tableflow: Not Just Another Kafka-to-Iceberg Connector.
Queues for Kafka.
… and many more!

Kafka Summit → Current

This was the first Current London. Last year and a few years before, London hosted Kafka Summits.

When Confluent announced Current as the next generation of Kafka Summit, they talked about “a place for everyone in the ecosystem to come together and share their knowledge and best practices”. Hilariously, the reality was quite the opposite.

Redpanda was banned from participating in the event.

I don’t know the details, but I got a confirmation from several Redpanda employees.

I guess this could’ve been one of the reasons to rebrand the community Kafka Summit conference, an event about the open-source Apache Kafka technology, into a vendor-specific event4.

Also, AWS didn’t have a booth at the conference despite being a gold sponsor (most of the other gold sponsors had presence). I can’t say why. AWS competes with Confluent with their managed Kafka and Flink offerings.

All of this suggests that the competition in the data streaming space is only intensifying 🍿.

Personal Announcement

I’ve recently launched Irontools: a suite of Apache Flink extensions to make your streaming pipelines faster, leaner, and more flexible.

This is a well-known pattern. I believe Netflix started using it as early as 2017.

IMO, implementing this as a feature in Iceberg could’ve probably been easier…

Which has nothing to do with this Data Mesh.

Yes, Confluent organized many Kafka Summits in the past. However, my point is that the spirit of those events was always community-first.

Announcing Irontools

Yaroslav Tkachenko — Wed, 07 May 2025 15:44:43 GMT

I’ll be brief - I launched something!

Introducing Irontools: a suite of Apache Flink extensions to make your streaming pipelines faster, leaner, and more flexible. Check out the full announcement here.

Irontools is my way of tackling some of the data streaming challenges I recently wrote about - from developer experience to runtime efficiency:

Modern Data Streaming Challenges: Part 2

Yaroslav Tkachenko — Thu, 24 Apr 2025 16:31:03 GMT

This is part 2 of the Modern Data Streaming Challenges series. Part 1 is available here.

Today, I want to talk about developer experience.

What is Developer Experience, Exactly?

Ask ten different people; I bet you’d get ten different answers. In my opinion, Developer Experience (DX) includes everything that’s related to how you interact with a piece of technology: writing code, debugging, deploying, etc. You want to be able to iterate fast. Testing should be easy. Low coupling, high cohesion. And so on and so forth. Overall, you want the development process to be clear and efficient.

When talking about DX, some people focus on technicalities: you need to implement a CLI this way, local development should be done like this, or you need to be able to deploy that way, etc. Of course, a lot of these things are very important, but I think they’re somewhat secondary. In my opinion, everything originates from the programming language and programming model. Choosing the right language and model makes it much easier to have good DX.

Programming Languages and Models

I don’t want to give you a history lesson or theorize which language is supposedly the best for data streaming: history is history, and choosing the language is often subjective. However, what’s clear to me is that making it possible to author data streaming pipelines in more languages is going to be a huge win for the industry. So, whenever a company decides to adopt data streaming to support some new AI data use case, they don’t need to necessarily learn new language, new tooling, or new DX!

I want to live in a world where the most demanding data streaming abstractions can be expressed in TypeScript. Or Ruby. Or whatever works for you.

Of course, you probably immediately hear an inner voice saying: but what about using the right tool for the right job? Sure, but how can we say something is not right in this case? What do you actually need?

In my opinion, you can identify two distinct levels of abstraction for any data streaming technology:

Dataflow definition: creating a graph of connected sources, transforms and sinks. This can be expressed quite easily in any high-level language, either with functional composition, OOP, or a Builder pattern. A DataFrame-like interface is quite popular in modern data processing frameworks.
Optionally, message processing: how each individual message should be transformed, enriched, etc. Marked as optional since it is not always needed (e.g. in a case of simple count without any transformations).

As you can see, there is nothing special in either of those. Any general high-level programming language should be capable of expressing both abstraction levels.

So, if this is so straightforward, why don’t we see more data streaming tech implemented in various languages? In my mind, it’s a combination of historical decisions (betting on Java and Python), architectural gravity (it’s hard to migrate away) and a lot of recent focus on SQL…

And just to clarify: there are definitely many new data streaming frameworks and stream-processing libraries written in Rust, Go, TypeScript, and some exotic languages. However, none of them is even close to getting the same level of adoption as Flink, Spark, or Kafka Streams.

In fact, you can even get by with YAML for the dataflow definition (see Flink CDC, Redpanda Connect). This is where the right programming model helps a lot: typically, in this case, you want a strict declarative language with no side effects. It’s funny how the slightest mismatch in the model perception makes things much worse: Spark and Flink both need to serialize method closures in order to send them over the network to workers / task managers; this breaks “this is just a declaratively defined dataflow definition” abstraction. So, naturally, it’s one of the first issues any engineer faces when they start using Spark or Flink.

Finally, I want to highlight that I’m not advocating for any specific language: I really want to avoid a situation like getting Node overload. The ability to choose is more important.

Shifting Left to Make It Right

The discussion about programming languages can’t be complete without the “shift left” trend. Historically, data teams have been treated as pretty much a separate organization: most data pipelines populated internal datasets not exposed to users, so application developers never cared about the data world. This led to database migrations and back-end changes routinely breaking data pipelines.

“Shift left” prescribes that data developers be more involved in the application domain, e.g. by partnering with application teams and crafting end-to-end data products. As usual, this is mostly a people problem, so easier said than done… But this is the best approach I can think of.

Maybe surprisingly, but in practice, “shift left” works both ways. You also want your application developers to care about datasets and data pipelines. And this can be extremely challenging if all your data pipelines are implemented using a different language (and maybe even a completely different programming model).

The “AI” craziness that we experience right now will make this even more important: you can’t build a decent, personalized “AI” experience without having real-time data.

About SQL

I feel like I'm starting to sound repetitive, so bear with me. I love streaming databases and streaming SQL. It means that people without a software engineering background, but with some SQL knowledge, can build streaming workloads (at least in theory). IVM databases have made a lot of progress. TimePlus is great at combining historical and real-time data. DeltaStream has announced support for unified batch/streaming/real-time SQL. Arroyo got acquired.

And yet, I don’t think we’ll ever be able to satisfy all data streaming use cases with SQL. It’s coming back to having the right programming model. And, sometimes, declarative dataflow is not expressive enough. Or precise enough to get the right semantics or the right performance optimizations. For example, no matter how smart and powerful your query optimizer, sometimes we know exactly the level of parallelism a certain operator needs. Or that you really need to add an extra shuffle over there. It can also be hard to reuse, refactor and unit-test your SQL logic. dbt templating can only go so far.

User-defined functions (UDFs) help, but they only operate at the message processing level. And, unfortunately, they’re frequently a black-box for the runtime.

And then, getting DX for streaming SQL is hard! I gave this talk a few years ago, thinking that I figured out most of the challenges. But that was just the tip of the iceberg. There are still really hard, unsolved problems like state reuse and evolution in streaming queries.

SQL does try to evolve (e.g Polymorphic table functions look really cool), so, in a way, it feels like we’re just starting (even though the language is 40+ years old!). Some companies don’t want to wait and extend streaming SQL with useful abstractions like changelogs right now1 2.

Conclusion

So, what am I trying to say? We need to build more! More declarative (SQL) and imperative. More programming models and abstractions. More languages. Time will tell what works.

Here are a few things on my mind recently.

Cloudflare Pipelines made a lot of noise when it was announced, but it doesn’t look like it can support any advanced use cases (the documentation is just sad to look at). Integrating Arroyo actually feels like a step back: we want streaming JavaScript, not SQL!

AquaLang is a great attempt to properly bridge dataflow and message processing levels together. I like this project a lot, it shows the kind of innovation we need.

Server-side WebAssembly is finally reaching the point where it can be widely used.

And I still think about actors, especially with all the new durable execution tech:

https://docs.timeplus.com/proton-create-stream#changelog-stream

https://docs.deltastream.io/reference/sql-syntax/query/create-changelog-as

Modern Data Streaming Challenges: Part 1

Yaroslav Tkachenko — Mon, 31 Mar 2025 16:18:25 GMT

I’ve been building data streaming systems for the past 8 years, and I feel like, as an industry, we haven’t made a huge amount of progress when it comes to data streaming and stream-processing adoption. Yes, Kafka 4.0 and Flink 2.0 are very different from the versions of Kafka and Flink from 8 years ago (in a really good way). Streaming databases are here, and streaming SQL is a thing. ML / “AI” requirements made near-realtime data pipelines very important.

And yet, data streaming is struggling. I know many startups in the space have a hard time attracting customers. Confluent’s growth is steady but not that impressive.

Why?

I don’t have a simple answer. But I’m trying to come up with a few themes explaining the challenges of modern data streaming. I’m going to use Apache Flink for my examples since it’s the most popular stream-processing technology out there.

Today, I want to talk about efficiency.

Large-Scale Systems Are Not Necessarily Efficient

Many large-scale distributed systems like Hadoop, Spark, Kafka, Flink, etc. are very impressive in many dimensions, but they were built to solve the scalability problem first. They’re not necessarily efficient. You should really understand the difference. For example, just because Flink can scale to thousands of task slots doesn’t mean that each task slot will perform the most efficient computation. It’s actually much more challenging to build.

Here are just a few examples from Flink’s Slack and mailing list. One person wrote:

I wanted to discuss our current system setup and see if Apache Flink would be a good fit to replace one of our consumer groups. Right now, we have two Kafka consumer groups: the raw-consumer group, which is scalable and handles the input from vehicles, and the parsed-consumer group, which handles the post-processing tasks like decoding and filtering. The issue we’re facing is that while the raw-consumer group can scale effectively with the number of vehicles, the parsed-consumer group starts to slow down when we increase the load, especially as the tasks it performs are quite CPU-intensive.
Currently, we’re running the system with 20 vehicles, and here’s the resource breakdown: ... However, when scaling to 10,000 vehicles, we’re expecting a significant increase in resource usage. …
Given this, we’re wondering if replacing the parsed-consumer group with Apache Flink would be a good solution. Flink is known for handling heavy processing tasks like decoding and filtering efficiently, and we believe it could scale better to meet the needs of our growing vehicle fleet. Would Flink be able to handle this load more effectively than our current parsed-consumer group? And if we were to make this transition, would it provide the scalability we need without running into performance issues or excessive resource consumption?

The person asking this question has a CPU-bound Kafka Consumer with straightforward stateless logic. But for some reason, they think that Flink could be more efficient for this workload. This doesn’t make sense - a Flink job will need to do at least the same amount of work by leveraging the same Kafka Consumer logic. In practice, it’ll do much more: checkpointing, maybe additional serialization/deserialization, shuffling, etc.

Flink excels at stateful computations like joins and aggregation, but if you have a simple stateless Kafka Consumer that’s CPU bound, you should focus on profiling it, not rewriting it to a Flink job.

Here’s another one:

I have a job entirely written in Flink SQL. The first part of the program processes 10 input topics and generates one output topic with normalized messages and some filtering applied (really easy, some where by fields and substring). Nine of the topics produce between hundreds and thousands of messages per second, with an average of 4–10 partitions each. The other topic produces 150K messages per second and has 500 partitions. They are unioned to the output topic.
The average output rate needed to avoid lag after filtering messages should be around 60K messages per second. I’ve been testing different configurations of parallelism, slots and pods (everything runs on Kubernetes), but I’m far from achieving those numbers.
In the latest configuration, I used 20 pods, a parallelism of 120, with 4 slots per taskmanager. With this setup, I achieve approximately 20K messages per second, but I’m unable to consume the largest topic at the rate messages are being produced. Additionally, setting parallelism to 120 creates hundreds of subtasks for the smaller topics, which don’t do much but still consume minimal resources even if idle.
…
When I check the use of cpu and memory to the pods and don't see any problem and they are far from the limit, each taskmanager has 4gb and 2cpus and they are never close to using the CPU.
…. How can I improve the throughput rate? Should I be concerned about the hundreds of subtasks created for the smaller topics?

The author later mentions that they perform a join, which probably (at least partially) explains the performance degradation. Unfortunately, it’s very easy to write a poorly performing join in Flink, especially in Flink SQL.

However, it’s pretty interesting to highlight another inefficiency: because Flink doesn’t allow fine-grained parallelism tuning for Kafka sources in Flink SQL1, the author had to apply unnecessarily high levels of parallelism to every source, allocating a lot of wasteful resources.

So, repeat after me: scalability != efficiency. My advice here is to try building good mental models for the technologies that you use and choosing the right tool for the job.

Efficiency == 💰💰💰

You may say: but why do you care about the efficiency so much?

Well, efficiency translates to cost savings. From Meta’s blog:

So, the engineer typed an “&” after the auto keyword to indicate we want a reference instead of a copy. It was a one-character commit, which, after it was shipped to production, equated to an estimated 15,000 servers in capacity savings per year!

Not bad for having a slightly more efficient system?

Or look at Apache DataFusion Comet. It reimplements a bunch of Spark operators in a more efficient language (Rust) and runtime (Arrow/DataFusion) with ~2x overall speedup, which could mean ~2x lower bill. And it’s a drop-in replacement!

Cost efficiency is especially important in the post-ZIRP era and for teams with reduced headcounts.

But Is It Really a Big Deal?

Cost savings are nice, sure, but is it really a deal-breaker?

OK, not necessarily, I agree. Even without the cost aspect, inefficiency is just one particular thing that contributes to the overall problem.

However, inefficiencies can be multiplicative.

Having a single poorly performing join might be fine, but it’s hard to deal with eight of them. A slow Kafka Avro source could be ok to use, but if you need to regularly reprocess petabytes of data, it could even affect the way you design your overall architecture.

Another way to look at this: a vendor that sells a somewhat inefficient product needs to adjust its margins accordingly, which affects the final cost and experience. This gets passed on to you. If you do the same for your customers, the effect multiplies.

By the way, I also think it’s the reason why the Lambda architecture is still alive. It’s still much more efficient to process a LakeHouse table / a bunch of Parquet files than the same amount of data sitting in a Kafka topic.

Learn From Databases

I hope you’re convinced by now. What can we do?

I think the solutions are out there. “Just” copy the ideas from the database research in the past 5-10 years.

Storage / compute separation is an example that everyone understands. Tiered storage in Kafka, disaggregated state store in Flink. The idea initially appeared in the database world.
Another idea is applying query engine optimizations like predicate pushdown. This makes a massive impact on query performance and efficiency. Unfortunately, standard Kafka storage is not designed to support this. That’s why I’m very bullish on LakeHouse support in Kafka like Confluent Tableflow and Iceberg Topics in Redpanda. Combine that with a Hybrid Source in Flink, and you get a very efficient engine.
A more interesting one is columnar data processing and vectorization. I keep linking to this blog from Arroyo because it’s so good. Alibaba is working on a lot of innovations here with Fluss (columnar streaming storage) and Flash (vectorized Flink-compatible engine).
Finally, code specialization / compilation. Frequently used to optimize queries, but why not apply it more widely? We deal with a lot of schemas, so let’s leverage that!

Another thing worth mentioning is Polymorphic Table Functions (PTFs) in Flink. A declarative language like SQL can make writing some simple things very hard (or inefficient). PTFs allow you to stay in the SQL ecosystem but create new, highly customizable operators.

Actually, it looks like the support for specifying individual Kafka source parallelism in Flink SQL was merged a month ago! Still not released, though.

Exploring Apache DataFusion as a Foundation for Streaming Framework

Yaroslav Tkachenko — Mon, 10 Feb 2025 16:57:53 GMT

Please pledge your support if you find this newsletter useful. I’m not planning to introduce paid-only posts anytime soon, but I’d appreciate some support from the readers. Thank you!

Rust, Arrow, and DataFusion

Over the past few years, “Rewrite Bigdata in Rust” has become a real movement. The idea is that Rust can bring modern systems-level performance to data processing without the usual trade-offs. One of the key enablers in this space is Apache Arrow, an in-memory columnar format that has rapidly become the go-to for high-performance analytics.

Apache DataFusion is a query engine written in Rust that heavily leverages Arrow. It’s primarily used for building databases and query engines, but there’s also plenty of excitement in the community around real-time data processing. DataFusion was designed to be very extensible. You can add your own connectors, formats, and operators.

I’ve been working on a new stream-processing framework on top of DataFusion for the past few months1. If you may remember, I shared some thoughts on what a new stream-processing framework may look like:

This post looks at DataFusion and tries to answer the following question: Can it serve as a solid foundation for a streaming framework? Or, put differently, how close does it come to offering what systems like Apache Flink provide out of the box?

What Does a Streaming Framework Need?

A reliable and performant stream-processing framework does more than just shuffle Kafka records around in real time. It typically addresses several key areas:

First, it needs an internal data format that can efficiently store different data types. Historically, streaming frameworks used row-based formats, but lately, I have seen more adoption of columnar formats.

Second, it provides an execution runtime that can handle unbounded data. This means the runtime must keep up with a continuous flow of records. It usually operates in a distributed fashion, so you can scale horizontally across many nodes. The runtime also tends to include backpressure mechanisms so that if downstream operators get overwhelmed, upstream producers can slow down or queue data more gracefully.

Third, connectors are essential to bring data in from external sources - like Kafka, Pulsar, Redpanda, or various databases over CDC - and send processed results to sinks such as data lakes and OLAP databases. Connectors also need to handle the offset or checkpoint mechanism so that if a job crashes, it doesn’t start from the very beginning unless explicitly told to do so.

Another area is fault tolerance. Continuous data processing systems are expected to run for days, weeks, or even months without interruption. Yet all things fail, patches and updates need to happen, etc. A typical approach is Chandy–Lamport checkpointing, which involves periodically saving the progress of the streaming pipeline so it can be reconstructed after a restart.

Finally, state management and time-based semantics are required to support stateful transformations like joins and aggregations. I previously wrote about it here:

Prior Art

I know about a few projects that build stream-processing frameworks on top of DataFusion:

Arroyo. It is one of the earliest projects in the space (btw, Arroyo’s blog is top-notch, and I highly recommend following it). However, as far as I know, it mostly uses DataFusion for SQL parsing and generating logical plans. It does use Arrow as a data format. The execution runtime is custom-built, as well as connectors.
Synnada. I don’t think any of their work on top of DataFusion is open-sourced. They do give a lot of great talks, especially in the context of building a unified, not just a stream-processing engine.
Denormalized. The latest addition. Denormalized tries to be a “DuckDB for streaming” by focusing on single-node execution.

DataFormat and Execution Runtime

Apache DataFusion operates by processing Arrow batches of records (RecordBatch) through a pipeline of operators.

Arrow is a columnar data format. It means that your intermediate data is stored as a set of arrays. Each RecordBatch is 8192 records by default. Some purists may say, isn’t it an indicator of a micro-batch system? Well, I don’t think so. Even in the case of Apache Flink, data is always batched at many levels. Processing semantics are more important in this case.

Does batching increase latency? Well, it may, but there are ways to work around it. Micah Wylde, CEO and Founder of Arroyo, explained it really well here.

DataFusion is designed as a pull-based engine. Conceptually, it means that each operator runs a tight loop that pulls data from the upstream sources. In practice, DataFusion uses Tokio Streams. I want to highlight two observations:

Tokio Stream (kinda like an iterator of Futures) is the primary abstraction, even when it comes to bounded sources (e.g. reading a bunch of Parquet files).
Pull-based execution doesn’t offer much control over backpressure. This makes it very different from Apache Flink, which can offer reliable backpressure, fine-grained flow control and adaptive buffers between operators. These things are not as important in the context of a query engine (whose goal is to read a bunch of files as fast as possible), but they do matter a lot for a streaming engine.

Anyway, having the Stream abstraction as a default way of dealing with data seems like a huge help when building a stream-processing engine!

Does it mean that DataFusion’s runtime can be easily used as a streaming runtime out of the box? Almost, but not quite. Each operator in DataFusion defines its execution mode: Bounded, Unbounded or PipelineBreaking. The first two should be pretty straightforward to understand: use Unbounded operators (and build your custom ones as Unbounded) and avoid Bounded operators in the streaming context. As far as I can see, Bounded is mostly used by the source operators that scan files (also things like the LIMIT clause, EXPLAIN command, etc.)

The PipelineBreaking operators are more complicated. Here’s the description of this execution mode:

Some of the operator's input stream(s) are unbounded, but the operator
cannot generate streaming results from these streaming inputs.

In this case, the execution mode will be pipeline breaking, e.g. the
operator requires unbounded memory to generate results. This
information is used by the planner when performing sanity checks
on plans processings unbounded data sources.

As you probably guessed, this means that some types of joins and aggregations are PipelineBreaking operators, which means that they can’t be used in the streaming context!

The only workaround in this case is implementing your own operator. This is what Denormalized has done to support windowed aggregations.

DataFusion operators also support partitioning. By default, since DataFusion is a single-node engine, it uses the number of CPU cores as a number of partitions. It’s possible to define the number of partitions and the partitioning strategy in your custom operators.

Connectors: Building Your Own

For a streaming framework to be truly useful, it has to integrate with real data sources. DataFusion doesn’t provide an extensive set of built-in connectors at the moment, mostly because it was never designed specifically for data integration use cases. Out of the box, you can read different types of files (locally or from the object storage). There is no standard Kafka connector.

DataFusion Table Providers is an excellent source of community-provided connectors. However, some of those connectors were not designed to be executed in the streaming environments, so some tweaks may be needed.

You can create a custom connector by implementing a TableProvider trait. Check this handy guide. I’ve had a chance to build both source and sink connectors, and I’d say that the experience is quite nice2.

Overall, it doesn’t seem too hard to write a Rust connector for common platforms like Kafka, but it becomes trickier when you factor in the need for failure recovery.

Checkpointing and Fault Tolerance

Checkpointing is one of the biggest differences between a “batch query engine” and a genuine streaming framework. Systems like Apache Flink store a snapshot of the entire pipeline state at regular intervals so they can resume from the last good snapshot if something goes wrong. DataFusion just doesn't have anything like that.

I’d say that correctly implementing checkpointing and failure recovery would probably be the hardest thing for implementing an MVP of a streaming engine on top of DataFusion.

I’d encourage you to check checkpointing implementation in Arroyo and Denormalized.

Scaling Out Beyond a Single Node

DataFusion runs on a single node by default. While that might be sufficient for some workloads, many streaming use cases can easily exceed the capacity of one machine. Two main paths come to mind:

You can run multiple instances of DataFusion in a Kafka Streams–style model, where a consumer group is used to coordinate partition assignment across different instances. Of course, it means you can only support Kafka-compatible sources.
You could look at distributed frameworks like Ballista (part of the DataFusion ecosystem) or general-purpose engines like Ray for parallel execution across many nodes. Both Ballista and Ray have been getting a lot of attention recently from the folks wanting to run DataFusion at scale. Anyway, this option means your system now needs to support shuffles, which brings a lot of complexity. I haven’t had a chance to explore these tools yet, but the first thing that comes to mind when evaluating these options is support for streaming execution. Just having a shuffle operator is not enough, you need a streaming shuffle.

No Typical Stateful Streaming Features

It looks like DataFusion doesn’t have built-in operators that provide arbitrary state access (something similar to ValueState in Flink). Creating such functionality feels straightforward - “just” add a wrapper for RocksDB or SlateDB. However, again, I think a lot of complexity around checkpointing and failure recovery will need to be addressed.

Many streaming scenarios revolve around time-based aggregations and handling out-of-order events with watermarks. DataFusion currently doesn’t support these concepts at all (there is no need for this in batch execution). Tokio timers would probably be a way to implement windowing.

Conclusion

Building on top of DataFusion is not for the faint of heart. There is very little documentation, and most of the time, you end up reading the source to understand what’s going on. Thankfully, the source code is very friendly!

If you’re serious about building a production-grade stream-processing framework, you’ll need to invest a lot:

make sure the operators you want to support can be executed in the streaming environment.
have some form of failure recovery.
build source and sink connectors.
find a way to scale out (either via something like Kafka consumer groups or by adopting Ballista/Ray).
(if you need stateful streaming): introduce key-value store and windowing abstractions.

Sounds like a lot of work! 🙂 However, building an engine from scratch is way more work. DataFusion is an excellent foundation, and it’s highly customizable - you can change almost any aspect of the system if you don’t like it. The list of known users is very impressive!

Also, the “Rewrite Big Data in Rust” movement has momentum, and plenty of folks are intrigued by the idea of a next-generation streaming engine that combines safety, speed, and a modern language ecosystem. Watch this space.

It’s part of a consulting project I’m working on for a client. There is no appetite for open-sourcing it anytime soon.

Unless you need to deal with Avro. Unfortunately, its support is still not great.

Fluss: First Impression

Yaroslav Tkachenko — Thu, 05 Dec 2024 21:32:30 GMT

Please pledge your support if you find this newsletter useful. I’m not planning to introduce paid-only posts anytime soon, but I’d appreciate some support from the readers. Thank you!

Intro

Fluss is a new streaming platform from Alibaba that was open-sourced last week. Check the announcement post.

It’s quite similar to other streaming platforms like Apache Kafka, Apache Pulsar, Redpanda, etc., but also very different in some aspects (and has some truly unique features). It’s designed to be tightly integrated with LakeHouses like Apache Iceberg and Apache Paimon.

Currently, Fluss doesn’t implement Kafka protocol, but it’s on the roadmap. It makes it difficult to evaluate properly - right now, you have to use Apache Flink and/or Apache Paimon to interact with it. However, it’s modelled similarly: tables instead of topics, buckets instead of partitions, etc.

Fluss Architecture

Table as a primary concept

Something that jumps out immediately when you check the documentation is that Fluss organizes data streams as schematized tables, not topics. It’s quite common to use Avro or Protobuf with a schema registry when using Kafka. However, it’s never required: to Kafka, every record value is just an array of bytes, which makes it possible to deal with semistructured or unstructured data.

Fluss demands a schema to be defined before you can write anything. I think it’ll make it much harder to use for semistructured data or data streams with schemas that need to evolve a lot.

Putting tables at the front also indicates that Fluss tries to be more of a database than a streaming platform. It makes a lot of sense given the advertised use cases: fast ingestion for real-time data and union read with LakeHouses (more on this below) for real-time analytics.

PrimaryKey Table

Fluss supports regular append-only Log Tables, as well as PrimaryKey Tables. PrimaryKey Tables seem to be the most impressive feature of Fluss at first glance. Lookup queries are Kafka’s bane: it’s very common to have a need to find a very specific subset of messages in a topic, and the only way to do it in almost any Kafka-compatible platform is to scan the whole topic from the beginning… which can take hours. Tiered storage can improve this, but not significantly.

PrimaryKey Tables are backed by RocksDB to support updates, deletes and efficient lookup queries! So, it’s possible to execute something like “SELECT * FROM users_table WHERE user_id = 123” in Flink SQL and get the results back reasonably fast.

These tables also support changelog semantics and partial updates.

There is no free lunch, and I think that the performance of these tables will be an order of magnitude lower compared to Log Tables.

More unification

I wrote this post last year:

I predicted to see data platform unification, where a streaming log (e.g. Kafka), a LakeHouse (e.g. Iceberg) and an OLAP database (e.g. ClickHouse) become much more unified “consolidated data engine” that’s easier to use and manage, not three separate systems. I stand corrected, we may see this much earlier 🙂. Several vendors have announced plans to integrate Kafka with Iceberg API (Redpanda already has Iceberg support in beta), which also applies to Fluss. It tightly integrates with LakeHouses and highlights the union read use case (also known as Hybrid Source in Apache Flink).

However, one can argue that Fluss goes beyond just these two dimensions with the PrimaryKey Table abstraction. The underlying RocksDB database can be used to serve not just efficient lookup queries but aggregations as well! There is a small example in the docs. RocksDB was also used by the Rockset database as an underlying storage for their analytical storage engine, so it’s clearly possible to pull off, at least at a certain scale.

Delta Join

This is something that I really wanted to see in Flink several years ago. It may finally happen!

The idea here is very simple:

A typical “windowless” stream-stream join needs to accumulate state for both sides forever. This becomes very challenging at scale. I wrote about this before.
Streaming platform provides a form of tiered storage with effectively infinite retention.
So, instead of accumulating data in state, we can perform a lookup when needed and query the tiered storage instead. Some form of batching will likely be required.
In a typical tiered storage implementation that’s available in Apache Kafka, Confluent or Redpanda, these lookups are not as efficient. But in Fluss, PrimaryKey Tables make it possible. I imagine Fluss contributors can also decide to add additional secondary indexes to make lookups faster.

Feel free to check the official proposal for the Delta Join in Flink.

Implementation

Fluss is implemented in Java and currently requires Zookeeper for coordination. Tables are partitioned and replicated similarly to Kafka topics. Data is stored on local disks. Overall, this feels like a reliable but somewhat dated design (what is this, 2019? where is Rust?! 😜). The roadmap already mentions planned changes like Zookeeper removal and zero disk architecture though.

When reading through some parts of the source code, I couldn’t get rid of the impression that I was looking at an amalgamation of Flink (Fluss borrowed its type system), Paimon (some design decisions) and Kafka (same abstractions, configuration, etc.). This is not surprising at all, given the team who’s working on Fluss (Flink and Paimon contributors).

Using Arrow as a primary data exchange protocol is what makes Fluss very interesting. Producers accumulate batches of Arrow vectors and send them to the server using Arrow IPC.

I invite you to check ArrowLogWriteBatch.java and MemoryLogRecordsArrowBuilder.java to understand how Arrow batches are written.

If you still question whether vectorized columnar format is a good choice for a streaming system, I recommend you to read this blog post from Arroyo. Some highlights:

We built a new SQL Engine on Arrow and DataFusion

Fluss protocol is implemented using Protobuf, and it’s quite easy to grasp: check the full spec here. It’s less than 1K LOC at the moment (I’m curious what it’ll look like after adding support for consumer groups…).

Conclusion

Fluss is a truly unique system that tries to model real-time data streams as schematized tables. It’s still very early, but I’d keep an eye on it: PrimaryKey Tables and Delta Joins can become killer features.

And Apache Flink users should be really excited about the roadmap:

Fluss Roadmap

Things like predicate pushdown and cost-based optimizer can really make a 10x - 100x difference.

Events

I’ll be speaking at these events next week:

Redpanda Streamfest. Come join us to hear my predictions about the future of data streaming!
Apache Flink 2.0: What’s to come and the impact on stream processing webinar from DeltaStream.

Streaming Data Best Practices

Yaroslav Tkachenko — Fri, 29 Nov 2024 05:04:56 GMT

This post is a periodically updated collection of publications from this newsletter highlighting the best practices for building data streaming pipelines and platforms.

Start with Streaming-First Architecture. Even if you’re not going to follow it, it offers a lot of food for thought.
Design your messages with Approaches for Defining Message Metadata.
Decide on using Changelog vs Append-Only Data Streams.
Understand how to ingest operational data with Change Data Capture Is Still an Anti-pattern. And You Still Should Use It.
Understand how to ingest event data with Building Event Ingestion HTTP API for Kafka.
Decide what data streams to materialize after reading Considerations for Data Stream Materialization.
Determine if exactly-once delivery is a must with Do you really need exactly-once delivery?
Think about stateful stream processing after reading State and Timers.
Consider how to deploy your stateful applications with Blackhole Sink Pattern for Blue-Green Deployments.

Do you even need Kafka?

Yaroslav Tkachenko — Mon, 28 Oct 2024 15:52:56 GMT

Please pledge your support if you find this newsletter useful. I’m not planning to introduce paid-only posts anytime soon, but I’d appreciate some support from the readers. Thank you!

Recently, I’ve heard arguments like “Our solution doesn’t require Kafka” or “You can do data streaming without Kafka” from several vendors. I wanted to zoom in and understand if a streaming platform like Kafka is still needed nowadays (spoiler alert: it’s not going anywhere).

Which Kafka?

But first of all, ask people what they mean when they say Kafka. It could either mean Apache Kafka or another tool that supports Kafka API.

Historically, Apache Kafka was perceived as a fairly complicated system that was hard to deploy and maintain. Using Kafka in a small or medium-sized company was often considered to be overengineering (“of course there is a simpler way to do X!”). I guess it might still be the case, but things have changed drastically.

Modern Apache Kafka doesn’t need Zookeeper, Redpanda doesn’t need JVM, and WarpStream doesn’t even need any disks or state to manage. There are also Apache Pulsar, AutoMQ, Buf and probably other solutions that support Kafka API. Many have fully managed offerings.

So, let’s forget about the “Kafka is really hard to run” argument for a second.

Streaming databases

Coming back to the original statements, I frequently hear them in the context of streaming databases. For example, when using Materialize, you can get really strong consistency AND latency guarantees when connecting directly to Postgres. If all you need is to offload a complicated query or report from your read replica, then the Postgres & Materialize combo is sufficient. I do feel like it can be a powerful way to build user-facing data products: ingest data from an operational database, transform it into the right shape, and serve API requests using the underlying storage. OH: That microservice should have been a SQL query is still my favourite talk on this topic.

Or perhaps you just want to move your operational data from MySQL to Snowflake. It seems like the Change Data Capture tools (e.g. Debezium) made operational databases really popular as a data source. Even five years ago, application or clickstream events were much more favoured data source. But nowadays, thanks to reliable CDC tools, many companies want to see their operational data in their data lake or data warehouse first.

Also, some application events could be translated into CDC streams thanks to the Outbox pattern.

So, to summarize, streaming databases can act as a bridge between operational data (thanks to CDC capabilities), which could be messy, require normalization, etc., and a primary data lake / data warehouse.

The best data hub

But despite the increased popularity of operational data, a typical enterprise data platform consists of many more sources of data: not just application or clickstream events but also third-party data, SaaS vendor data, legacy applications data, and many more.

When I worked at Activision, we ingested game data from game consoles (like PlayStation), mobile devices, and our partners. The game telemetry we received could contain anything: match summary, marketplace purchase, GPU temperature or even binary crash dump data.

Messaging or streaming platform is still the best way to do data integration. Enterprise Integration Patterns is 20 years old, but it’s incredibly relevant. I doubt things will drastically change in the next 20 years. Yes, perhaps we’ll see data lakehouses play a larger role, but they won’t be able to replace realtime use cases (without evolving a lot).

Also, a streaming platform like Kafka is still one of the best ways to quickly and reliably ingest large volumes of structured or unstructured data.

Finally, a streaming platform like Kafka is great at dataset reuse. Consumers are very cheap. The same topic could be used for writing data to the data lake, powering an OLAP database, hydrating a cache, populating a search index, and emitting user notifications.

To be honest, I feel a bit silly repeating these features (like I’m a salesperson trying to sell Kafka). The industry should’ve learned the power of reusable data streams by now, but I guess it’s good to remind about it once in a while.

Verdict

If you just need to move your relational database data into a data warehouse, you don’t necessarily need to use a streaming platform like Kafka. However, any decently sized enterprise data platform would benefit from it anytime.

Events

I’ll be speaking at Redpanda’s Streamfest online conference in early December. You can register here.

Current 2024

Yaroslav Tkachenko — Mon, 23 Sep 2024 16:04:11 GMT

Please pledge your support if you find this newsletter useful. I’m not planning to introduce paid-only posts anytime soon, but I’d appreciate some support from the readers. Thank you!

Current is one of the main events in the data streaming space. Current 2024 happened last week in Austin, Texas.

Keynotes

The keynotes were somewhat underwhelming. There were no huge announcements or new product launches. Even Jay’s “one more thing” announcement regarding the WarpStream acquisition already wasn’t big news at the time.

It was great to hear that Confluent’s Tableflow will be in the open preview very shortly. Also, many Flink-related updates from them:

Python and Java Table API support is coming.
Flink private networking and external schema registry support.
Flink available in the Confluent Platform.

I think it’s clear Flink users require much more than a Flink SQL editor.

But probably the main theme of the first day keynote was “AI”1. I feel like every vendor must mention “AI” at any serious tech event nowadays: you want to show your customers and your investors that you can either leverage or assist with building the new wave of “AI” products.

However, in Confluent’s case (or any streaming data infra company, for that matter), I feel like the value is real: many user-facing products need near real-time data, which is hard to implement without streaming. I’m not convinced that a Flink UDF calling OpenAI endpoint is all we need (see more on this below).

The second day keynote was focused on developers. Kafka Docker images, DLQ for Kafka Streams, PATCH endpoints for Kafka Connect, and, of course, the VS Code extension were all very welcomed.

When Tim Berglund appeared on stage, I felt like the world became a little bit brighter. I missed him2! Btw, Tim interviewed me on the Streaming Audio podcast a few years ago.

“Shifting left to make it right”

“Shifting left” was mentioned during the keynotes five times (I counted). I also heard it in the hallways a lot. In case you don’t know, in the data platform context, shifting left means working more closely with operational / application development teams. For example, it means shared ownership over data products or data pipelines with the goal of stopping data artifacts from being treated as a second-class citizen. Data Mesh architecture is one of the ways to implement this principle.

It’s quite refreshing to hear this not just from consultants or vendors but large enterprises as well. I suspect that execs have finally started to understand the importance of high quality data. If you want to build actually useful user-facing “AI” products, you can’t do it without clean and fresh datasets. And yet, most of the enterprises still struggle with basic BI projects…

I really hope to see some change in the industry. I spent some time in my career as a data engineer building data pipelines, and sometimes, working around application teams was a ridiculous process costing the company tens of thousands of dollars. For example, one time at Shopify, we implemented a streaming join between nine (!) tables because the application team couldn’t emit a single domain event containing all the relevant information3.

WarpStream Acquisition

The acquisition was announced a week before, but I’d love to share a few things in this post since we heard its CEO, Richie Artoul, during the first day keynote.

First of all, as you may know, I’m a big fan of WarpStream.

I think their architecture is truly novel and truly cloud-native, and it supports a few very interesting features (check the post above).

Somehow, they were insanely productive and, with just ten-ish people, built a great product that was able to compete with major vendors. I was so excited about their future!

However, they chose an exit. Confluent has made a great move, eliminating a competitor and getting a stellar team.

I know that the WarpStream team still has a lot of autonomy; Confluent also chose to keep the brand (it’s now “WarpStream by Confluent”), so I’m hopeful they’ll be able to innovate at the same pace or even faster (with more resources now).

Some people pointed out that WarpStream uses a different stack (Golang instead of JVM), but, in my opinion, nowadays, large companies are totally fine at integrating at the Kubernetes level. It’s not that important what’s running in a given container.

I’m slightly more concerned about additional services WarpStream is building, like the Schema Registry and the Iceberg support. They’re essentially duplicating Confluent’s existing (or incoming) products. I’d be very curious to see how they resolve this. For example, I was looking forward to using fully managed WarpStream’s Schema Registry; running the open-source version of the Confluent Schema Registry is not fun.

Redpanda Announcements

Similarly, Redpanda hasn’t announced anything at the event, but they did a few major announcements a week before (trying to steal the thunder), so I think it’s important to cover their updates as well.

First of all, they announced Redpanda One: a single, multi-modal engine. It includes Cloud Topics, which is Redpanda’s answer to WarpStream.

I must say, it’s quite impressive. An ability to choose the underlying topic storage (high-latency object storage vs regular on-disk vs ultra low-latency) on a per-topic basis is very powerful! It does mean a more complicated design, though.

The Iceberg support is finally coming to Redpanda. They were actually the first vendor to announce this capability (more than a year ago). I’ve seen the demo, and it looks slick!

Redpanda also made “AI” capabilities its focus lately. But instead of just making it easier to call OpenAI API, they actually integrated Redpanda Connect with LLMs. This fits their Sovereign AI messaging quite well.

Overall, it feels like Redpanda is able to keep up with Confluent and, in some cases, innovate and advance further. I really hope that they’ll succeed, because we need more competion in this space.

Talks

Here’s a selection of great talks I had a chance to attend:

Consistency and Streaming: Why You Care Should About It. If you’ve been following Materialize, you won’t learn anything new, but if the topic of data consistency in the streaming context is new to you, this is a great explainer. My advice to Materialize is to cover the use case of using payload-level timestamp fields as virtual timestamps when consuming Kafka topics - this is what 80% of people care about.
Scaling Data Ingestion: Overcoming Challenges with Cell Architecture. A pragmatic way to scale stream-processing via object storage. Showing a solid example of the cell architecture in the wild. And some real-world comparisons between Iceberg, Paimon and raw files.
You’ll be surprised to learn who won in the end…
Enabling Flink's Cloud-Native Future: Introducing Disaggregated State in Flink 2.0. Flink 2.0 will be getting a state backend that natively supports object storage. This comes with a lot of challenges. It seems like Async APIs will help with reducing the impact of increased latency. The current implementation performs at 40% of the baseline level (disk-based), which is considered pretty good. We’ll see what the final number will be.
Events at the (API) Horizon: How to synthesize domain events & changes from your HTTP/REST APIs. Very interesting talk comparing eventing and REST APIs. A demo of Eventception, which is worth checking if you use a service proxy like Enovy or Kong and care about events.
Flinking Enrichment: Shouldn't This Be Easier? Legendary David Anderson talking about one of the most challenging problems in streaming: data enrichment. Lots of focus on SQL and join semantics, with many great learnings. It’d be great to see more low level examples in the future.
Bridging the Kafka/Iceberg Divide. How do you represent a Kafka topic in the Avro format as an Iceberg table? What if it has incompatible versions? Folks from Confluent sharing their learnings from building Tableflow.
Building a Scalable Flink Platform: A Tale of 15,000 Jobs at Netflix. Nice talk from Netflix showing the evolution of their Flink managed platform that works at scale.
Flink’s Control Plane at Netflix

Also in my watchlist:

Speed Wins: The Basics of How to Push More Queries Through Each CPU Core.
The SQL Ecosystem: Powering the Instant World with 40-Year-Old Legacy?
Streamlining Entry into Streaming Analytics with JupyterHub and Apache Flink.
Towards a Self-Tuning Flink Runtime: A Year in Flink’s Scheduling.
Change Data Capture & Kafka: How Slack Transitioned to CDC with Debezium & Kafka Connect.

Finally, please check the Streamlining History: ClickHouse & Flink's Fast-Track for Data Backfills talk from my co-worker Rafael Aguiar! In the world where Iceberg is supported natively by Kafka brokers, we’ll be using something like Flink’s Hybrid Source to merge the two. Rafael explains how to do that but with ClickHouse.

Cool Vendors

I had a chance to talk to a few really cool companies I hadn’t heard of before:

DBOS: a new durable execution engine. It reminds me of Restate. They made some interesting design choices, like relying on Postgres for journaling or heavily using annotations/decorators in their SDKs, but they’re quite comparable in terms of functionality. Overall, the product feels somewhat simpler, probably in a good way.
thatDot, the company behind Quine:
Quine is a streaming graph interpreter; a server-side program that consumes data, builds it into a stateful graph structure, and runs live computation on that graph to answer questions or compute results. Those results stream out in real-time.
Let me expand on this.

I haven’t heard about this approach for stateful stream-processing before. It seems like they try to leverage some properties you can get by representing relations as graphs.

They do make some very strong claims. They say that they can support “infinite joins” because they’re not limited by time windowing “like every other event processor”, which sounds a bit naive.

First of all, you actually don’t need windowing to implement stateful stream-processing with Apache Flink, for example. Windowing is very helpful at limiting the amount of state to maintain, but you might be ok with the additional cost and complexity. In the case of that large streaming pipeline with many joins we implemented at Shopify (which I described above), we didn’t use windowing for most of the joins. We were able to ingest all of Shopify’s sales/order data (via CDC) and keep it in state (which was ~13 TB at the time). The savepointing was somewhat challenging, but it was still doable.

Even though the graphs can help you with representing the data better, you’re still limited by basic resources like memory. And when you start spilling data to disk, you might not be faster than Flink with RocksDB.

Anyway, I’d love for someone to do a proper benchmark comparing Quine and a non-windowed Flink pipeline. Their approach looks very, very interesting.

“That Is Not a Real SQL”

I overheard one of the developer advocates from a well-known company say something like this: “Well, Flink SQL is not a real SQL, you know, like in databases”. This was so frustrating to hear that I decided to write my response here. And by the way, I’m not even a Flink SQL fan (it has its issues).

I acknowledge that modern databases can do amazing things. For example, DuckDB and Umbra/CedarDB really challenge what we think is possible. But there is nothing magical about them. In the end, every database just queries files in a for loop (more or less 🙂). Pretty much every database does things like SQL to AST conversion, building logical and physical plans, and optimizing them.

Flink SQL is exactly the same. It relies on Apache Calcite quite a bit, which is also used by many other data frameworks and query engines. Flink SQL goes through the same stages like SQL to AST conversion, building logical and physical plans, and optimizing them. It just happens to usually work with batches of data received from data streams like Kafka topics, not files.

The world of databases and streaming systems is converging. Perhaps it’s advanced more than you realize.

I reject using “AI” without the quotes in the context of modern “AI” systems like ChatGPT and LLMs. Humanity hasn’t invented actual Artificial Intelligence yet. What we have is Machine Learning.

Tim left Confluent to join StarTree as a VP of DevRel. He returned to Confluent a few months ago in the same capacity.

To be honest, this wasn’t completely that team’s fault. At the time, Shopify didn’t have a reliable way of sending domain events (e.g. with the Outbox pattern), and we couldn’t tolerate any data loss.

Streaming-First Architecture

Yaroslav Tkachenko — Mon, 02 Sep 2024 15:47:29 GMT

Please pledge your support if you find this newsletter useful. I’m not planning to introduce paid-only posts anytime soon, but I’d appreciate some support from the readers. Thank you!

I started using “Streaming-First Architecture” term a while ago, but recently, many people started asking what I mean specifically. I’d love to explain!

There are several building blocks, and most of these shouldn’t be surprising:

Kafka-compatible streaming platform (e.g. Apache Kafka, Confluent, Redpanda, WarpStream).
Stream-processing frameworks (e.g. Apache Flink, Kafka Streams, Materialize).
Connectors (e.g. Kafka Connect, Flink CDC).
Various databases and datastores that work well with real-time data (e.g. OLAP engines, LakeHouses).

However, the key principle is treating streaming data as a source of truth. Historically, streaming has been used for data transit: ingesting and delivering data to a data lake/warehouse or a database. Topics in a streaming platform like Kafka could have a few hours or a few days of retention. When that time is over, the data gets deleted (and if you couldn’t persist it for any reason, then it’s… gone).

In the Streaming-First Architecture, we treat streaming data as a source of truth, which practically means:

Modelling topics as changelogs (you can read more about this here). Ideally, enabling compaction.
- This allows us to treat Kafka a bit more like a database: we can use messages not just as inserts, but also updates and deletes. It makes it possible to correct / backfill data by just emitting new messages.
Setting infinite retention for the topics. This usually requires using some form of Tiered Storage.

All of these together allow us to keep the data in the streaming platform forever.

What are the benefits?

I don’t think I need to tell you about the benefits of streaming data (otherwise, this newsletter has failed 😀).

However, it’s not very common to keep data in the streaming platform forever, so let me expand here. It was prohibitively expensive before the introduction of Tiered Storage. And it’s not just the cost. The operational complexity of archiving data from Kafka to long-term storage like HDFS, then object storage, and now LakeHouses is still fairly high. Sure, it might be fine for a handful of topics, but doing it at scale with schema evolution and versioning support is still non-trivial. And when using historical data, you either need to support a hybrid source pattern (reading historical data from the archive first, then switching to the “edge” in Kafka) or re-hydration of topics from historical data (which surely messes up the ordering).

In the Streaming-First Architecture, you just set the consumer to read from the earliest offset and… that’s it. It may seem like a small detail, but it makes a huge difference when it comes to operational complexity. Kafka API also allows you to initialize consumers with a given timestamp, which is very handy in this case.

This approach also allows you to keep a lot of heavy transformation logic in the stream-processing layer. You can push the most demanding joins and other stateful computations to your database of choice, but many stateless computations, as well as aggregations, are straightforward to implement in the streaming world.

Something is missing

When you only have Kafka topics and consumers / stream-processors as building blocks, you quickly realize that you can’t perform many computations on the historical data very efficiently. Kafka doesn’t support any kind of lookup queries. So, very often, you’ll have to scan through the whole topic to find a handful of records. The throughput is limited by the number of partitions. This may seem like a terrible and wasteful approach. However, it can work sometimes: consumers in Kafka are fairly cheap; you just need to get the networking right. Waiting extra time for backfills may also be justified.

But just like with Tiered Storage, we’re expected to see a massive improvement in this area in the next ~6-9 months. Redpanda announced plans to introduce Iceberg support for Tiered Storage a while ago, Confluent is working on Tableflow, and WarpStream seems to be working on a query engine as well.

A query engine on top of the streaming platform is a game changer. We don’t need to archive our streaming data to a LakeHouse anymore; the streaming platform has become a LakeHouse!

This solves the lookup queries (as long as the query engines support predicate pushdown) and throughput when backfilling.

This may also solve the painful eventual consistency problem that you face when heavily relying on streaming. Exactly-once is hard. I imagine most of the query engines built on top of Kafka storage to support some form of snapshot isolation.

Of course, some of it is still a speculation. But the direction is very clear!

But does it scale?

We’ll see. Personally, I worked with hundreds of terabytes stored in the streaming platform without any issues.

But a lot depends on the implementation of Tiered Storage. I’ve seen some implementations that don’t perform well with high concurrency; they only expect the historical data to be accessed infrequently.

I have a lot of faith in the new wave of projects that heavily rely on object storage. Here I wrote about WarpStream, which I see as a clear leader in this area right now.

Isn’t it just Kappa Architecture?

In my opinion, the original Kappa Architecture is more of a vision. It doesn’t answer many hard questions that immediately come up when you try to implement it. It doesn’t prescribe almost anything except using Kafka.

Using a different name makes it clear: I’m proposing something quite specific and opinionated.

Streaming Platforms in the Cloud Era

Yaroslav Tkachenko — Mon, 15 Jul 2024 15:47:35 GMT

Please pledge your support if you find this newsletter useful. I’m not planning to introduce paid-only posts anytime soon, but I’d appreciate some support from the readers. Thank you!

WarpStream made a lot of noise with its announcement in July 2023. Kafka-compatible streaming platform based on object storage with no local disks! Some called it brilliant; some called it terrible. It made a huge impact nonetheless.

In July 2024, I’m confident to say that this is precisely where the industry is going. And instead of theorizing, I want to share my personal experience of running WarpStream in production at scale.

Recap

If you don’t breathe streaming platforms like some of us, let me summarize the architecture behind WarpStream. It’s quite simple and elegant:

Separate data in the data plane (e.g. payloads) from the metadata in the control plane (e.g. topic/partition information).
Use object storage as your primary (and only) data storage. Use compaction.
Make the data plane (“agents”) completely stateless.
Implement smart routing that can leverage cloud economics (e.g. traffic within the availability zone is free).

Finally, sprinkle everything with a bunch of intelligent optimizations like this one. As a result, you get a streaming platform with higher latency but superior in every other aspect.

Not Just WarpStream

Many people saw the clear benefits of a cloud-native architecture that WarpStream demonstrated. In the last 12 months:

Confluent has announced Freight Clusters.
StreamNative has announced Ursa for Apache Pulsar.
- Perhaps it was too inspired by WarpStream…
AutoMQ, a cloud-native fork of Kafka, has made a similar announcement as well.

And a few new projects came out of stealth:

Tektite, not just a streaming platform but a stream-processing engine as well.
S2, tries to unify the ideas of object storage and data streaming even more.
Bufstream, provides tighter integration with the Schema Registry and Protocol Buffers.

It’s notable that some systems decided not to get rid of the disks completely. E.g., AutoMQ explained that low latency is still extremely important for the streaming workloads, so you should have an option of using EBS disk as a Write-Ahead-Log (more on this below).

I’m sure many other vendors are currently working on similar projects. But WarpStream was the first to use this architecture, so everyone will benefit from learning from its experience.

So here are some learnings.

Simplicity

I used to run and be on-call for Apache Kafka clusters. Large ones. In the cloud and on-prem.

I used Confluent Cloud and AWS MSK. I was a very early adopter of Redpanda Cloud.

I even used something called CloudKarafka (before the rename).

And after 7+ years of running and interacting with all these platforms, I can genuinely say that WarpStream is the easiest platform to run. No, I mean it. I even insisted on using the BYOC offering despite having a small team.

You just need to ensure that your WarpStream agents don’t go above 90% CPU. Ideally, with some autoscaling in place. That’s it. It’s nearly indestructible otherwise. You can casually 20x the traffic, observe the auto-scaling kick in, and the traffic spike is soaked!

You also don’t need to think about data skew and data rebalancing: everything is written to object storage and compacted right away.

The data/metadata separation and stateless agents make it that simple to run.

Software engineers outside of the data infra community have been criticizing Kafka for its complexity. But what if operating Kafka can be no different from operating a web service?

WarpStream talks a lot about the absence of inter-AZ traffic and the cost implications. In my opinion, the perceived simplicity of their architecture (on the user side) is more important.

Agent Roles

Even though WarpStream architecture looks simple, it’s flexible enough to support different types of topologies, something that's impossible with a regular set of Apache Kafka brokers.

For example, in addition to your regular agents, you can deploy an extra set of agents (an agent group) that only has the proxy-consume role, so you can have a dedicated read-only Kafka endpoint. Or the opposite, agents with the proxy-produce role could be used to only support data ingestion. Or you can offload compaction and other background jobs to dedicated set of agents with the jobs role.

This may eliminate the need to have different types of clusters optimized for different workloads, something I had to do a few years ago with Apache Kafka:

We used to have different types of Apache Kafka clusters to support different workloads

To clarify, this separation of roles is not really needed to implement WarpStream-like architecture. Other vendors may choose not to do it. But having it gives a lot of control to the users and can eliminate even more operational complexity.

Cost

Yes, it seems to be cheaper than alternatives.

Autoscaling helps a lot to avoid provisioning additional infrastructure to handle traffic spikes.

S3 is extremely cheap as a primary storage.

WarpStream doesn’t care how many partitions, active TCP connections or compacted topics you have (yep, all these are used by some vendors). The pricing model just tracks the throughput and the number of object storage requests.

And then there is the no inter-AZ traffic guarantee. This requires some setup on the client side, but it’s a one-time, straightforward config change. And here’s how you can achieve paying $0 for inter-AZ networking end-to-end:

Setup WarpStream BYOC and place at least one agent in each zone.
Configure the required client changes for your application.
Ensure that each instance of a given stream-processing application (e.g. Apache Flink or Kafka Streams) is placed into the same AZ.

As a result, all traffic between your streaming platform (e.g. WarpStream) and stream-processing applications is constrained to a single AZ! Of course, this may not be what you want from the reliability perspective, but if you can tolerate some downtime while reassigning workloads between AZs, this can be a very attractive option. Of course, WarpStream is designed to fallback to other AZs in this scenario; you just need to ensure your stream-processing application is moved.

Despite the Azure announcement, I don’t believe AWS or GCP will stop charging for inter-AZ traffic soon.

My new favourite game: balancing traffic across AZs

“Free” Partitions

Well, not completely free, but close to that.

Due to the way WarpStream handles batching (any agent can batch data from any topic/partition together) AND the fact the metadata is separated, having an actively written partition incurs very low overhead. And having an inactive partition is almost free: you just need to store the metadata (and return it in the metadata requests).

This means that you can generally create more topics and don’t think too much about hitting a partition limit. You’ll be fine with tens of thousands of partitions with a few agents, as long as you don’t write to all of them at once. In my opinion, that’s why WarpStream doesn’t use partition count for billing or throttling.

This might seem like a small detail, but it brings WarpStream closer to the vision of the streaming-first data architecture I’ve been trying to popularize in the last few years. A streaming platform can be a better source of truth for your data infra. It needs a few things, and the ability to have more topics and more partitions without massive clusters is one of those (I mention another at the end).

What’s the Catch?

You need to configure and tune your clients certain way. This is a one-time investment, though.

Transactions are WIP (but you may be fine without them).

And then there is latency. By default, you get around a second of p99 end-to-end latency. You can easily shave off a few hundred milliseconds with a couple of simple config changes (like agent batching). It’s still going to be higher than what you get with Apache Kafka or Redpanda.

However, in my experience, this is perfectly fine for many data pipelines. Populating data lakes and/or data warehouses - totally fine. Change data capture for replicating data to search, cache or K/V database - fine. Ingesting observability data - probably fine. A lot of ML use cases like feature extraction are also fine.

Of course, there are some domains where you do need lower latency. Fraud detection, trading, online gaming, etc. Kafka-like platforms are also frequently used for messaging between microservices. And if you have several “hops” before triggering a user-facing action, for example, latency will add up quickly.

That’s why several vendors mentioned that you can’t just rely on object storage for low-latency use cases. AutoMQ, for example, mentions using EBS disks for low-latency Write-Ahead-Log.

In fact, I believe that WarpStream architecture already supports this use case. WarpStream can use the S3 Express One Zone bucket for landing data with low latency to one place and then use a compaction mechanism for moving data to the standard bucket for permanent storage. This brings p99 for the end-to-end latency to ~150ms. Even AutoMQ mentions S3 Express One Zone as an alternative to EBS.

I don’t think it’ll require much work (Richie, Ryan, forgive me for saying this 🙂) to generalize this approach, so a regular disk or a NoSQL database can be used instead for landing data. This can probably bring latency to tens of milliseconds, which makes it comparable to a typical streaming platform. Of course, it increases cost. And this is exactly how you need to balance this: cost vs latency.

Final Thoughts

WarpStream is not perfect. There are some rough edges, some APIs are not yet implemented. The Schema Registry support is still WIP.

But the fundamentals are solid. I’m so excited about the future and what WarpStream-like architecture allows us to do.

WarpStream really moves us closer to what I call streaming-first data architecture. The only missing piece right now is querying the data with something other than the Kafka protocol: REST API, Apache Iceberg integration, Trino connector or something similar. WarpStream already handles the storage, metadata and compaction needed for building the query engine.

Data Streaming Journal

Can Kafka Queues Make Consumers Faster?

What’s in It for Data Streaming Engineers?

The Benchmark

Results

Analysis and Conclusion

Other Posts

Advanced Apache Flink

Advanced Apache Flink On-Demand Course

Benchmarking CDC Tools: Supermetal vs Debezium vs Flink CDC

About CDC Tools

Supermetal

Debezium

Flink CDC

Test Setup

Generated Data

Snapshotting Mode

Supermetal

Flink CDC

Debezium

Live Mode

Debezium

Supermetal

Other Notes

Analysis & Conclusion

Parallelization

Why is Flink CDC the Slowest?

I Expect Even Bigger Difference With Transformations

Summary

Why Apache Flink Is Not Going Anywhere

What Critics Say

Flink Is Too Complex

Nobody Needs Realtime

Look at the Confluent Earnings Report!

The “I Don’t Know Java” Argument Is Outdated

Flink Is Ubiquitous

Flink Future: Process Table Functions (PTFs) and Disaggregated Storage

Conclusion

Current New Orleans 2025

Keynotes & Announcements

AI

Interesting Talks

FlinkSketch: Democratizing the Benefits of Sketches for the Flink Community

Change Data Capture at Scale: Insights from Slack’s Streaming Pipeline

Making agents, workflows, and event-driven apps as simple as SpringBoot, with Restate

StreamLink: Real-Time Data Ingestion at OpenAI Scale

Sizing, Benchmarking and Performance Tuning Apache Flink Clusters

Press Play on Data: Netflix’s Journey from Streams to Gaming Insights

Redpanda Resistance

Confluent Catalysts

Upstream: New Webinar Series

Data Streaming Academy

Announcing Data Streaming Academy

Flink Forward 2025

Ververica Announcements

VERA-X

The Atmosphere on the Ground

Interesting Talks

The *Big State* Monster: Taming State Size in Multi-Way Joins with FLIP-516

Apache Fluss and the Seven Deadly Sins of Streaming Analytics

Petabytes, Pipelines & PyFlink: How We Stream, Enrich & Validate Billions of Events

Redefining Flink Reliability — Blue/Green Deployments in Production

Democratizing Flink SQL at Shopify: Scaling Streaming for Every Developer

Flink SQL 2025: Powering Real-Time AI and Stream Processing Innovations

Powering Stateful Joins at Scale with Flink SQL at LinkedIn

Dynamic, Scalable, and Schema-Evolving: Introducing the Flink Dynamic Iceberg Sink

Irontools

Events

Streaming and the RAD Stack

Following Up

Checkpointing and Fault Tolerance

Scaling Out Beyond a Single Node

Connectors: Building Your Own

Implementing a Plugin System

Performance Gain Is Visible

Extending DataFusion: Lessons Learned

Pay Attention to Your Functions

Stateless and Incremental Doesn’t Mean Streaming Friendly

Not Optimizing Is Sometimes a Good Thing

Community Alignment Is Not Quite There

The Big State Monster: Taming State Size in Multi-Way Joins with FLIP-516