Great piece and thanks for the Streambased shout out. It's true we follow the same principles as Iceberg Topics but we invert the flow. Instead of serving Kafka clients from Iceberg data as Iceberg Topics does we serve Iceberg from Kafka topics (Iceberg processors reading data directly from Kafka).
The nice thing about this approach is it is not dependent on tiered storage so will work with any Kafka compatible service. What's more, this means that we don't have to wait for data to be tiered before it is available as Iceberg, everything from the beginning of the topic to the latest offset is accessible.
> ... but this seems to be the first free and open-source implementation
This is not accurate. AutoMQ's Table Topic capability was officially released in 2024 and was officially open-sourced on May 19th. You can find the release information here: https://github.com/AutoMQ/automq/releases/tag/1.5.0
The difference from Aiven's solution is that AutoMQ adopts a Copy Based strategy. This is mainly based on the following considerations.
1. Materializing Topic data into Iceberg requires writing logs, converting to table format, and persisting. Storage/network costs are already incurred; keeping logs adds little extra cost.
2. Table Topic lets users set shorter log TTLs, balancing cold read performance and cost—flexibility users value.
3. Iceberg is a table; Topics are append-only. Can “zero copy” handle PK tables (e.g., CDC)? When data lands in Iceberg, CRUDs must be applied.
4. If “zero copy” supports PK tables, does Kafka lose its “source of truth” status? Can the original stream still be reconstructed?
Appreciate your feedback! What I meant is "free and open-source that can be used by any Apache Kafka cluster out there". AutoMQ is a Kafka-compatible product.
It shouldn’t affect end to end latency because Kafka tiers segments to object storage asynchronously and only once a certain time or size threshold is surpassed (eg every 12 hours)
Whatever data is kept locally should be completely unaffected because it’s a different path
The cold data being read from s3 via the Kafka api though should certainly see some increased CPU use and perhaps latency. That usually isn’t referred to as end to end latency though
Great piece and thanks for the Streambased shout out. It's true we follow the same principles as Iceberg Topics but we invert the flow. Instead of serving Kafka clients from Iceberg data as Iceberg Topics does we serve Iceberg from Kafka topics (Iceberg processors reading data directly from Kafka).
The nice thing about this approach is it is not dependent on tiered storage so will work with any Kafka compatible service. What's more, this means that we don't have to wait for data to be tiered before it is available as Iceberg, everything from the beginning of the topic to the latest offset is accessible.
> ... but this seems to be the first free and open-source implementation
This is not accurate. AutoMQ's Table Topic capability was officially released in 2024 and was officially open-sourced on May 19th. You can find the release information here: https://github.com/AutoMQ/automq/releases/tag/1.5.0
The difference from Aiven's solution is that AutoMQ adopts a Copy Based strategy. This is mainly based on the following considerations.
1. Materializing Topic data into Iceberg requires writing logs, converting to table format, and persisting. Storage/network costs are already incurred; keeping logs adds little extra cost.
2. Table Topic lets users set shorter log TTLs, balancing cold read performance and cost—flexibility users value.
3. Iceberg is a table; Topics are append-only. Can “zero copy” handle PK tables (e.g., CDC)? When data lands in Iceberg, CRUDs must be applied.
4. If “zero copy” supports PK tables, does Kafka lose its “source of truth” status? Can the original stream still be reconstructed?
We also have a blog to explain how we achieve this feature: https://www.automq.com/blog/automq-kafka-to-iceberg-table-topic
I updated the post.
Appreciate your feedback! What I meant is "free and open-source that can be used by any Apache Kafka cluster out there". AutoMQ is a Kafka-compatible product.
It shouldn’t affect end to end latency because Kafka tiers segments to object storage asynchronously and only once a certain time or size threshold is surpassed (eg every 12 hours)
Whatever data is kept locally should be completely unaffected because it’s a different path
The cold data being read from s3 via the Kafka api though should certainly see some increased CPU use and perhaps latency. That usually isn’t referred to as end to end latency though
Yeah, I meant the read path.
The cold read path, right? Since there's a hot read path too. IMO worth it to be explicit in these things