Is Benthos the New Kafka Connect?

Probably not yet.

Jun 10, 2024

Please pledge your support if you find this newsletter useful. I’m not planning to introduce paid-only posts anytime soon, but I’d appreciate some support from the readers. Thank you!

Note: I started writing this post before the Redpanda announcement, so I’ll keep using “Benthos”. Nowadays, it could mean Benthos, Redpanda Connect or Bento (read below).

Intro

Benthos is a lightweight stream-processing framework for building data pipelines. It contains many different types of input and output connectors and provides transformation capabilities via a custom language called Bloblang.

I haven’t used it extensively, but I can try comparing it with Apache Flink. Benthos seems much more lightweight without strong support for stateful transformations like windowing, joins or aggregations. It’s definitely possible to implement similar functionality using built-in key-value stores (called Caches), perhaps with something like Redis. In fact, there is a guide showing how to perform a streaming join. It’ll likely not be as performant and scalable as Flink with RocksDB. It feels like Benthos is sitting somewhere between stateless and stateful systems, which is quite rare1.

Anyway, Benthos doesn’t need to be Flink. Benthos is really good at data integration use cases and simple stateless transformations like projections and filters. Here’s an example from the README:

input:
  gcp_pubsub:
    project: foo
    subscription: bar

pipeline:
  processors:
    - mapping: |
        root.message = this
        root.meta.link_count = this.links.length()
        root.user.age = this.user.age.number()

output:
  redis_streams:
    url: tcp://TODO:6379
    stream: baz
    max_in_flight: 20

YAML declarative configuration, lightweight Golang implementation and custom language for transformations heavily remind me of Vector from DataDog. However, Vector is really focused on observability data.

WarpStream + Benthos

In the last few months, WarpStream made a few announcements related to Benthos.

First, they embedded Benthos into their agents, then (on May 14th) introduced managed support for Benthos pipelines in the BYOC product.

I assume user growth is very important for WarpStream at this stage, and most of the potential users would already use some form of Kafka platform. So, it’s very important to support easy mirroring of data. They did it quite elegantly by introducing a special kafka_franz_warpstream component that can be used as a source and a sink. It handles all required client tuning and auth, so connecting to WarpStream is a no-brainer. Of course, Benthos can also be used for various data integration use cases, but I assume the data mirroring was one of the biggest motivators behind its introduction.

Redpanda Connect

On May 30th, Redpanda announced the acquisition of Benthos2 and introduced the new Redpanda Connect product, which is based on Benthos.

This is not the first data integration product from Redpanda: the managed Kafka Connect offering was available for a while. I’ve had a chance to use it, and my impression was really positive. The product somewhat lacked observability (I’d appreciate more metrics/logs/events), but it was solid otherwise.

However, Kafka Connect can be painful to manage (see below). It’s also written in Java, which is different from the rest of the Redpanda stack (which is mostly C++ and Golang).

Benthos comes with 200+ connectors, which is comparable to what Confluent Hub has. In my opinion, this acquisition makes a lot of sense, and Redpanda Connect is a really strong addition to the Redpanda family.

rpk (Redpanda’s CLI) already supports Redpanda Connect, and the experience is quite good. It basically replaces benthos with rpk connect.

I’m still unsure about the pricing for running the Redpanda Connect pipelines, especially in the cloud environment. I’d love to compare it to the Confluent one.

The Drama

Well, you probably feel the tension already 🙂. The latest announcements from WarpStream and Redpanda happened just two weeks apart.

And here’s what exactly happened during and after the Redpanda announcement:

Alex Gallego (CEO of Redpanda) shared the announcement with a few important caveats:
- Two connectors (Snowflake and Splunk) changed their license. They now require a paid subscription for Redpanda.
- Benthos is split into two repositories:
  - redpanda-data/benthos, which contains the Benthos engine and stays MIT-licensed.
  - redpanda-data/connect, which contains the connectors, including the paid ones.
- Also, https://www.benthos.dev (the original Benthos website) stopped working for a while, and then it started redirecting to the Redpanda Connect section of Redpanda docs.
- Another puzzling bit:
  If you are using benthos, it must be now called "Redpanda Connect" to remain trademark compliant.
  🤔
The next day, Richie Artoul (CEO of WarpStream) announced Bento, a fork of Benthos. I’m not going to share much here (please check the post!), but the main motivation was the level of hostility during the acquisition - renaming and rebranding everything (GitHub repository, the website, Discord and Slack channels) under Redpanda within a few hours. And, of course, making two connectors proprietary sets a precedent: if it happened once, it could happen again.
The day after, Jay Kreps (CEO of Confluent) also shared his opinion on this situation. The key insight for me was:
11/ The recipe for commercial success in the cloud today is a thriving open source project that defines a great platform layer with open governance combined with and a deeply differentiated cloud product that delivers that platform in a substantially better way.
And yes, Confluent’s Kafka implementation (Kora) is a proprietary closed-source project, but it’s based on the open Kafka protocol (which is a standard nowadays).

I want to admit I’ve known Alex and Richie for a while, and I greatly respect their work. I don’t think anyone is completely right or wrong here. I can’t say that the acquisition was executed perfectly, and even Richie admitted they didn’t want to create the fork. However, I’d like to underline Jay’s comment: everyone would benefit from a single project/standard that can be used by several companies as a foundation for their offerings.

Maybe Redpanda could’ve minimized the number of changes and made them less disruptive. Maybe WarpStream could’ve tried harder to collaborate. Maybe Benthos (the engine) could’ve been donated to CNCF or Apache or some other foundation. Now, in my opinion, the next best opportunity is standardizing the Benthos format. So, Redpanda Connect and Bento can be different implementations of the standard, but they could still support the same way of defining inputs, outputs, etc. Maybe share the same connectors. I’m sure it’s currently the case since the fork has just happened, but the delta between the projects will start growing.

Kafka Connect

Kafka Connect is a popular data integration framework that’s used by many Kafka users. Debezium, de-facto CDC standard, was built as a Kafka Connect connector (but you can definitely run it without Kafka Connect or Kafka). The S3 Sink connector was a go-to option for ingesting data into the data lakes for a while.

Kafka Connect has many good parts:

The connector configuration is simple to define (just a JSON object).
It’s easy to create a new connector - implement the interface in Java and package it in a JAR file.
It’s really popular and widely used.

The good parts make connectors really portable. This allows many vendors to use the same standard: Confluent, Redpanda and Aiven provide managed Kafka Connect offerings. This is a big win for customers.

However, Kafka Connect has a major architectural flaw that makes it hard to run at scale: it tries to be an orchestrator, but it does a poor job at that.

Gunnar Morling wrote a fantastic post on this topic that deeply resonates with me. Especially this part:

A second big source of issues is Connect’s model for the deployment of connectors, which in a way resembles the approach taken by Java application servers in the past: multiple, independent connectors are deployed and executed in shared JVM processes. This results in a lack of isolation between connectors, which can have far-reaching consequences in production scenarios:

I used to run Kafka Connect at scale. Trying to deploy different types of connectors (e.g. S3, Elasticsearch, Cassandra) on the same JVM was always a struggle. We ended up using the cluster-per-connector model, which was OK but not great. Imagine creating a dedicated K8s cluster for each application you need to deploy.

You could also argue that Kafka Connect lacks file-centric DevEx (popularized by Kubernetes), and working with JSON APIs can be awkward in the infrastructure-as-code environments. kcctl is trying to address this.

The Future

So, can (and should) Benthos replace Kafka Connect?

Operational simplicity is a key distinction when comparing them. Benthos can be easily deployed in K8s without thinking too much about anything else (like clusters, workers, and tasks). The file-centric approach makes it easier to integrate with the existing infrastructure-as-code projects. I assume it’s also more memory-efficient.

Another thing is the ability to perform more advanced transformations. When building data integration pipelines, we rarely send data in exactly the same shape. Kafka Connect has SMTs (Single Message Transforms), which are really simple stateless transformations that can filter, project or route messages. For anything complicated you have to use a stream-processing framework like Kafka Streams and materialize the intermediate representation in a Kafka topic. This is great for vendors selling Kafka, but can get quite expensive.

Benthos also has an absolutely killer feature in my mind: http_server source. Yes, you can expose Benthos over HTTP(S), and it can act as your data ingestion endpoint. As I previously described here, it’s no small feat. And http_server comes with fairly advanced features like rate limiting and Websocket support (!).

Finally, Kafka Connect always requires, well, Kafka as a source or sink. “Kafka as a data hub” or “Kafka as a nervous system of your enterprise” is something you can always see at a Kafka Summit keynote. In reality, sometimes, those point-to-point integrations do make sense (mostly from an economic standpoint). Also, not everyone uses Kafka or Kafka-like streaming platforms. Benthos also supports GCP Pub/Sub, NSQ and Redis Streams.

But first, Benthos should become a standard. The recent acquisition, repository split, and Bento fork somewhat fragmented the ecosystem. I really hope that all parties can standardize on a single engine or single format at a minimum.

PS: btw, can we still call it Benthos?

See this great post from Micah Wylde on this topic.

Can you really acquire an open-source project? Not in a traditional “acquiring a company” sense, I believe.

AdyEmm

In our experiment around 9 months ago, for a topic with certain number of partitions , when we created a Sink connector using Benthos - the number of consumer threads spawned exceeded the number of partitions of the topic and operation wise it was sluggish and slow waiting for very very long for a very simple sink operation

Expand full comment

1 reply by Yaroslav Tkachenko

Alex Pliutau

I love the benthos project, it's fast (thanks to Go) and well written. This drama though...

1 more comment...

Data Streaming Journey

Discussion about this post