No More Excuses for Stream/Table Duality

Kafka IS your LakeHouse now.

Aug 18, 2025

Last week, Aiven announced a free, open-source implementation of Apache Iceberg support for Apache Kafka topics. This is such an important milestone that I decided to take a quick break from building Irontools and write this short post.

Many Kafka vendors already support Iceberg sync (Confluent, Redpanda, Bufstream, Streambased), but this seems to be the first free and open-source implementation1.

It’s not quite production-ready at the moment (it needs to at least support basic schema evolution), but I imagine they’ll catch up quickly.

I don’t want to rephrase many posts about why Iceberg is important for Kafka. Feel free to check Aiven’s announcement, as well as posts from other companies:

Introducing Tableflow from Confluent
Apache Iceberg Topics: Stream directly into your data lake from Redpanda

Single-Copy Implementation

Perhaps the most interesting implementation detail mentioned in the whitepaper is the absence of data copying. Log segments are converted to Parquet files and stored in the object storage. These files are used both by the Iceberg reader, as well as the Kafka Tiered Storage components that serve Kafka requests. This likely means extra work on the fly, and it must affect read latency. This may sound like a bad idea, but I think it’s already proven - Streambased works in a similar fashion.

Streaming-First Architecture

I wrote about Streaming-First Architecture last year:

Streaming-First Architecture

Yaroslav Tkachenko

September 2, 2024

Read full story

I believe it’s a very powerful way of building streaming data pipelines. However, I identified LakeHouse support as the last missing building block. But it’s no longer missing! I didn’t want to say this when Confluent or Redpanda made their announcements: in the end, those were proprietary products. But with Aiven’s latest release, we also have a free, open-source implementation that will surely drive the adoption. Maybe we’ll even see some unification (e.g., a standard set of Kafka metadata columns… a man can dream).

Data Platforms in 2030

And here’s another post I wrote a year before that:

Data Platforms in 2030

Yaroslav Tkachenko

September 5, 2023

Read full story

In that post, I described the idea of a “consolidated data engine”: tightly integrated streaming log (e.g. Kafka), a LakeHouse (e.g. Iceberg) and a fast OLAP database (e.g. ClickHouse). You can build almost any data product with that combo.

It’s 2025, and it seems like we’re already 2/3 there 😎. I’m really excited about the next 5 years!

I’m building Irontools: a suite of Apache Flink extensions to make your streaming pipelines faster, cheaper, and more flexible.

For Apache Kafka. There are Kafka-compatible products like AutoMQ that released their support earlier.

Tom Scott

Aug 18

Great piece and thanks for the Streambased shout out. It's true we follow the same principles as Iceberg Topics but we invert the flow. Instead of serving Kafka clients from Iceberg data as Iceberg Topics does we serve Iceberg from Kafka topics (Iceberg processors reading data directly from Kafka).

The nice thing about this approach is it is not dependent on tiered storage so will work with any Kafka compatible service. What's more, this means that we don't have to wait for data to be tiered before it is available as Iceberg, everything from the beginning of the topic to the latest offset is accessible.

Expand full comment

Kaiming Wan

Aug 19Edited

> ... but this seems to be the first free and open-source implementation

This is not accurate. AutoMQ's Table Topic capability was officially released in 2024 and was officially open-sourced on May 19th. You can find the release information here: https://github.com/AutoMQ/automq/releases/tag/1.5.0

The difference from Aiven's solution is that AutoMQ adopts a Copy Based strategy. This is mainly based on the following considerations.

1. Materializing Topic data into Iceberg requires writing logs, converting to table format, and persisting. Storage/network costs are already incurred; keeping logs adds little extra cost.

2. Table Topic lets users set shorter log TTLs, balancing cold read performance and cost—flexibility users value.

3. Iceberg is a table; Topics are append-only. Can “zero copy” handle PK tables (e.g., CDC)? When data lands in Iceberg, CRUDs must be applied.

4. If “zero copy” supports PK tables, does Kafka lose its “source of truth” status? Can the original stream still be reconstructed?

We also have a blog to explain how we achieve this feature: https://www.automq.com/blog/automq-kafka-to-iceberg-table-topic

2 replies by Yaroslav Tkachenko

5 more comments...

Data Streaming Journey

Streaming-First Architecture

Data Platforms in 2030

Discussion about this post