7 Comments
User's avatar
Tom Scott's avatar

Great piece and thanks for the Streambased shout out. It's true we follow the same principles as Iceberg Topics but we invert the flow. Instead of serving Kafka clients from Iceberg data as Iceberg Topics does we serve Iceberg from Kafka topics (Iceberg processors reading data directly from Kafka).

The nice thing about this approach is it is not dependent on tiered storage so will work with any Kafka compatible service. What's more, this means that we don't have to wait for data to be tiered before it is available as Iceberg, everything from the beginning of the topic to the latest offset is accessible.

Expand full comment
Kaiming Wan's avatar

> ... but this seems to be the first free and open-source implementation

This is not accurate. AutoMQ's Table Topic capability was officially released in 2024 and was officially open-sourced on May 19th. You can find the release information here: https://github.com/AutoMQ/automq/releases/tag/1.5.0

The difference from Aiven's solution is that AutoMQ adopts a Copy Based strategy. This is mainly based on the following considerations.

1. Materializing Topic data into Iceberg requires writing logs, converting to table format, and persisting. Storage/network costs are already incurred; keeping logs adds little extra cost.

2. Table Topic lets users set shorter log TTLs, balancing cold read performance and cost—flexibility users value.

3. Iceberg is a table; Topics are append-only. Can “zero copy” handle PK tables (e.g., CDC)? When data lands in Iceberg, CRUDs must be applied.

4. If “zero copy” supports PK tables, does Kafka lose its “source of truth” status? Can the original stream still be reconstructed?

We also have a blog to explain how we achieve this feature: https://www.automq.com/blog/automq-kafka-to-iceberg-table-topic

Expand full comment
Yaroslav Tkachenko's avatar

I updated the post.

Expand full comment
Yaroslav Tkachenko's avatar

Appreciate your feedback! What I meant is "free and open-source that can be used by any Apache Kafka cluster out there". AutoMQ is a Kafka-compatible product.

Expand full comment
Stanislav Kozlovski's avatar

It shouldn’t affect end to end latency because Kafka tiers segments to object storage asynchronously and only once a certain time or size threshold is surpassed (eg every 12 hours)

Whatever data is kept locally should be completely unaffected because it’s a different path

The cold data being read from s3 via the Kafka api though should certainly see some increased CPU use and perhaps latency. That usually isn’t referred to as end to end latency though

Expand full comment
Yaroslav Tkachenko's avatar

Yeah, I meant the read path.

Expand full comment
Stanislav Kozlovski's avatar

The cold read path, right? Since there's a hot read path too. IMO worth it to be explicit in these things

Expand full comment