Introducing Streamling: Performant and Extensible Data Streaming Runtime
Transactional engine built with Rust, Arrow and DataFusion.
Streamling is a performant and extensible data streaming runtime built with the RAD stack (Rust, Arrow, DataFusion).
I shared some progress before (post one, post two), but it’s finally open-sourced! I’ve been working on Streamling together with a great team at Goldsky. Goldsky uses Streamling as the engine for its flagship product, Turbo Pipelines. It’s been running in production for months, powering hundreds of pipelines.
Jeff Ling, Goldsky’s CTO, wrote an announcement on the Goldsky blog covering Streamling’s vision, adoption, and next steps. In this post, I’d like to focus on some interesting implementation details and Streamling positioning (when should you use Streamling, over, say, Apache Flink?)
sources:
raw.transactions:
type: kafka
topic: raw.event.transaction
transforms:
large_transactions:
type: sql
primary_key: id
sql: |
SELECT *
FROM raw.transactions
WHERE amount > 1000
sinks:
pg.large_transactions:
from: large_transactions
type: postgres
schema: public
table: large_transactions
primary_key: idThis is a sample Streamling pipeline that consumes data from a Kafka topic and writes it to a Postgres table, while also executing a SQL transformation to filter data.
Streamling has several built-in connectors and transforms (more can be added via a flexible plugin system; I’ll expand on that below). Kafka, ClickHouse, Postgres, HTTP enrichment - all covered.
Yet Another Runtime?
Is there really a need for yet another data streaming runtime? Let’s first explore existing options on the market:
Classic “Big Data” engines: Apache Flink, Apache Spark, Kafka Streams. Fairly complex distributed systems, primarily designed to support streaming analytics workloads.
A new category of Streaming Databases: Materialize, Feldera, RisingWave, Arroyo. SQL-only interface, but typically stronger consistency guarantees.
Data integration tools: Kafka Connect, Redpanda Connect, Vector.
Most of these tools are targeted at data engineers building analytical data pipelines.
What makes Streamling different is its focus on transactional workloads. It doesn’t support joins or aggregations (the bread and butter of streaming analytics). Instead, it offers a streaming engine that can be used by application developers, back-end engineers, and folks building data pipelines that power user-facing products. Here are a few concrete examples:
As an application developer, you rarely want to write complex SQL to describe business logic. That’s why Streamling offers first-class support for TypeScript transforms (thanks to WebAssembly).
As an application developer, a lot of your data sources and data destinations are HTTP API endpoints. Not Kafka clusters or data lakes. Streamling supports using HTTP endpoints for transformations (e.g., for enrichment) and for sinks (e.g., as webhooks).
As an application developer, you want to write processed data to a database you’re familiar with, not another Kafka topic or an S3 bucket. That’s why Streamling has great support for Postgres.
Thanks to these features, Streamling really shines when it comes to connecting different application datastores and processing/routing data powering application backends.
Streamling can also be used for the classic realtime ETL and data integration pipelines. Thanks to DataFusion, you have access to a wide range of functions. Why use it over something like Apache Flink or similar tools in this case?
It’s much simpler to understand. No distributed shuffles, single binary to run.
It’s much more cost-efficient - it’s designed to work at a small scale very efficiently, but it can be scaled out too.
YAML-driven declarative configuration is a really good fit for agentic tools. We had a lot of success with generating pipeline configurations with LLMs. Streamling comes with a no-op “--validate” mode, which can be used by the agentic harness.
And because it’s so extensible, you can bring your domain knowledge into the engine. For example, Goldsky uses it to process blockchain datasets. But you can also implement a few plugins to support your custom datasets. Or ingest IoT telemetry. Or process financial transactions.
Architecture
Streamling follows a few key design decisions:
[Mostly] Single-node execution. Streamling is built on top of Apache DataFusion, an extensible query engine. Even though the engine is single-node, you can still easily parallelize work by leveraging the mechanics provided by data sources. For example, when using Kafka as a source, multiple Streamling instances use the same Kafka Consumer group, which naturally distributes source partitions across instances1. Single-node execution simplifies debugging and makes local development very straightforward.
[Mostly] Stateless processing. Streamling provides a State Backend that can be used by any component in the system. It’s currently designed to store small amounts of metadata: Kafka source stores its current consumer group offsets after receiving an acknowledged checkpoint. SQLite and Postgres are supported State Backends, but more implementations can be added in the future.
Serious delivery guarantees. Streamling leverages the Chandy–Lamport style of checkpoints to ensure at-least-once delivery. There are also various deduplication mechanisms, including UPSERT semantics in most sinks. Combining at-least-once delivery and deduplication mechanisms ensures effectively-once delivery in many scenarios.
Extreme efficiency. Thanks to Rust and Arrow, Streamling can process tens of thousands of messages per second on a 0.5 CPU core! And you can go as low as 0.25 CPU cores for some workloads.
Delightful developer experience. Almost everything can be extended as a plugin. Built-in validation mode. Live inspection of in-flight data. Print and blackhole (no-op) sinks for debugging / testing. Instant startup. OpenTelemetry integration.
There are also a few important implementation details to highlight:
Streamling relies on DataFusion as much as possible. Native connectors are implemented as Table Providers; SQL transform is “just” running DataFusion SQL, but with custom operators and UDFs. If you really zoom out, the runtime can be simplified as passing Apache Arrow batches using Tokio Streams. This forms a simple, but very efficient pull-based data processing engine.
Apache Arrow is used as the main in-memory data format and as a type system. Even though it’s column-oriented, it’s slowly becoming a standard not just for analytical systems, but for data processing in general. Especially when some kind of interop is needed, e.g. between Python and JVM: Arrow data is stored in memory in a language-independent format. Streamling leverages native Arrow support as much as possible. For example, ClickHouse integration relies on native Arrow support in ClickHouse, which dramatically reduces serialization/deserialization cost.
Using columnar data layout for streaming workloads is still not very common. But it makes a lot of sense in practice. Micah Wylde described it well here.
But Stateless?
You could argue that having a mostly stateless streaming system must be very limiting. Maybe you don’t need windowed aggregations or joins for analytical use cases, but these are still useful building blocks. For example, a join is a typical way to perform lookups. You may have a small “dimensional” dataset that you need to match against your data stream. This is common outside of analytics.
In fact, we’ve seen this use case over and over again, especially in the context of data filtering. Imagine you have a stream of user activity data, but you need to filter a small subset of it based on a list of user IDs, product IDs, or other dimensions. Can you build it without stateful joins?
We approached this pragmatically and developed a feature called dynamic tables. Not only can you perform filtering without storing any state in a streaming system, but you can also update the lookup data dynamically, without restarting or redeploying your pipelines. The implementation is quite straightforward: we use Postgres (again) to store the lookup data, so you can easily modify it programmatically or with your favourite Postgres client. At runtime, Streamling queries the underlying Postgres database, combines the data and applies filtering. Since Streamling operates on batches of data, we can perform lookups quite efficiently, amortizing the cost of an external call across thousands of rows in a single batch. We’re able to achieve decent throughput without introducing any additional in-memory cache, although it remains an option. You can also implement Dynamic Table support for Redis, Valkey, DynamoDB or any other fast key-value database, if Postgres starts to become a bottleneck.
Plugins
Streamling is designed to be easily extendable with plugins. As you probably expect, you can implement your own source, transformation, or sink as a plugin.
These plugin types offer two different levels of abstraction. At the higher level, as a plugin developer, you just need to implement the following interface (an example for sinks):
#[async_trait]
pub trait SinkPlugin: SupportsGracefulShutdown + Send + Sync {
async fn initialize(&self) -> Result<(), PluginError>;
fn labels(&self) -> Vec<PluginLabel>
async fn process_batch(&self, data: RecordBatch) -> Result<(), PluginError>;
async fn process_checkpoint_marker(&self, epoch: CheckpointEpoch) -> Result<(), PluginError>;
async fn process_checkpoint_finalizer(&self, epoch: CheckpointEpoch)
-> Result<(), PluginError>;
}process_batch is where you’d implement the sink logic: simply react to the incoming RecordBatch payloads. And if you want to participate in the checkpointing mechanism (which is required for at-least-once delivery), implement your checkpointing acknowledgment logic in process_checkpoint_marker (“prepare” phase) and finalize processing in process_checkpoint_finalizer (“commit” phase).
You can also implement custom UDFs (for SQL transforms) and even custom topology preprocessor plugins: these allow you to programmatically modify the defined topology before starting.
#[async_trait]
pub trait PreprocessorPlugin: Send + Sync {
async fn preprocess_topology(&self, config: String) -> Result<String, PluginError>;
}For example, it can be used to define custom logical data sources and turn them into physical sources (e.g. Kafka topics) at runtime. The data sources can be populated from your data catalog.
As you can see, all plugins have a very simple interface: the engine handles message passing, backpressure and checkpointing. When you create a plugin, you also get access to the input schema, the State Backend, and the metrics recorder. The important part is that as a plugin developer, you don’t need to think about the typical complexity that comes with stream processing: Streamling abstracts most of it from you. You can simply reason in terms of batches of Arrow data - how to generate, transform and use one.
The interface’s simplicity is intentional: as long as the system you want to interact with has a Rust SDK (or an API that can be called from Rust), implementing a plugin is very straightforward.
Streamling also provides macros to register plugins in the easiest way possible:
register_plugin_sink!("sqs", SqsSink);
// ...
register_plugin_preprocessor!("dataset_preprocessor", DatasetPreprocessor);
// ...
register_plugin_udf!(BytesToHexStringFunc);A plugin can be built as a regular Rust/Cargo project and dynamically linked at runtime by adding it to the STREAMLING__PLUGIN__PATH environment variable.
Compare it with the way your favourite data processing tool prescribes building connectors.
Conclusion
Streamling is a new performant and extensible data streaming runtime for transactional workloads. It makes it easy to build data pipelines for everyone, not just data engineers.
When we designed Streamling, we wanted to make pragmatic choices and minimize complexity. Is a distributed shuffle needed for the workloads we have? Not really. Can we avoid using a heavyweight state backend with something like RocksDB? Absolutely. Is it possible to use a lightweight scripting language for describing transformations instead of SQL? Definitely.
Streamling has been in development for the past few years, but it feels like we’re just getting started!
I encourage you to try out examples and consider contributing. I already have some ideas in mind that I want to prototype next - like SlateDB State Backend support or integrating datafusion-distributed.
Kafka Streams style.

