Why Apache Flink Is Not Going Anywhere
Not anytime soon.
Astrologers proclaim the month of criticizing Apache Flink1, I thought to myself recently. In the last few months, I have read many posts on social media, company blogs, and newsletters that attack Flink from different angles. So, today, I’d like to reply with this statement: Flink is not going anywhere.
What Critics Say
Flink Is Too Complex
“Flink is a complex distributed system, it’s a mistake to start with it”, “Watermarking and windowing are so complex”, “Stateful Flink pipelines are so hard to manage, RocksDB is a PITA and requires tons of tuning”, “Schema and state evolution are impossible to get right”, “There are very few real low-latency use cases”, etc.
I can go on, but you got the gist.
First of all, complexity is not something you can measure. You can’t say that tool A is 37% more complex than tool B. Something that is complex for one individual/team/organization is not necessarily complex for another, and vice versa. People often cite Postgres as an example of a simple tool that does the job. However, I believe most of them don’t mean you should actually set up and manage a Postgres cluster yourself. Because managing Postgres, especially after reaching a certain scale, can be quite painful. Even managed solutions like AWS RDS don’t fully shield you from that.
Second, you need to differentiate essential and accidental complexity. Accidental complexity comes from lack of experience, over-engineering, frequent pivots, and tight deadlines. But essential complexity exists no matter what we do (that’s why it’s called essential). For example, everyone agrees that Kubernetes is a very complex technology. But I’d argue that most of the complexity is essential: reliable infrastructure and application management are just really hard by definition.
I believe the same applies to Flink. Why? Supporting a variety of data streaming use cases (from data integration to streaming ETL to real-time analytics to Complex Event Processing) is quite hard. It’s also one of the most popular Apache projects with thousands of contributors who run Flink in many different scenarios. When a project gets to be that popular for so long, it’ll inevitably reduce its accidental complexity with time; otherwise, it’d be impossible to keep adding new features.
But, most importantly, you need to see what Flink is being compared with. When many critics say that Flink is complex, they compare it with a proprietary, managed, paid service they offer. It’s easy to make that statement because you have:
On one side, an open-source technology that you need to deploy and manage yourself.
On the other side, a small surface area of a managed service.
And when it comes to operating Flink, you can’t avoid infrastructure management, SRE practices, observability, and CI/CD tooling. But the claims that it’s really hard to manage are, at least, outdated. You can choose from many vendors (all major clouds, Confluent, Ververica). Also, the Flink Kubernetes Operator makes it very easy to deploy and manage Flink in a Kubernetes environment.
There are also claims that Flink requires a “cluster setup” and that it’s “centralized”, therefore, you need a specialized team to manage it. I guess this stems from the fact that Flink has a JobManager component which communicates with many TaskManagers (or “workers”). But with modern infrastructure management tools like the Flink Kubernetes Operator, all of this is mostly abstracted from you. And managing Flink becomes almost as easy as managing a pool of stateless web applications. I said “almost” because every Flink pipeline is stateful in some way (e.g., storing Kafka consumer group offsets), so it requires an extra step to take or recover from a savepoint, etc. But it can be easily automated.
Another claim I’ve seen: Flink is complex because it guarantees exactly-once delivery. However, this is a connector-level property: some connectors offer at-least-once, some exactly-once, and some both. Exactly-once can complicate your end-to-end system significantly (e.g. Kafka sink uses Kafka transactions), but you can almost always fallback to at-least-once.
Finally, I’d like to add this: of course, you shouldn’t use Flink for simple use cases that don’t require it. And of course, you don’t need to use all the features that Flink provides. Also, Flink is a great, versatile tool; however, sometimes, it’s more efficient to use it with something else. For example, arguably, many aggregation-related use cases can be better supported by a modern OLAP database like ClickHouse. But in some cases, it makes total sense to use windowed aggregations in Flink: ML feature engineering and sessionization come to mind. As usual, the answer to “should I use this or that” is not binary; it requires planning and research2.
Nobody Needs Realtime
Data streaming is usually associated with low-latency, as opposed to batch processing. “Low-latency” can mean a lot of different things in many contexts, but let’s assume that in the case of Flink, we’re talking about sub-100ms processing latency. Many critics say that very few use cases actually demand latency like this; therefore, Flink is not needed.
And this is a completely wrong take.
Data streaming is not about latency. It’s definitely helpful to have relatively low-latency, but to me it can be 100 milliseconds, 1 second or even 10 seconds in most scenarios.
Data streaming semantics and continuous processing are important parts. The ability to process data streams with transformations that leverage continuous processing is extremely powerful. For example, I can design my data streams as changelogs and enjoy end-to-end upsert semantics that automatically correct data. I wrote about it here:
Modern data streaming semantics have evolved from decades-old messaging semantics. Enterprise Integration Patterns is a book released 20+ years ago, and I still recommend it today. This book explains that messaging is the best approach for integrating different systems, something that data systems still struggle with today. It also describes patterns that became the foundation for many of the tools we use today, including Kafka and Flink.
So, yes, we rarely need 50ms processing latency. But we need reliable ways to process data streams, and stream-processing tools like Flink are, arguably, the best answer we have.
Look at the Confluent Earnings Report!
For some reason, people think looking at how much Confluent makes by selling Flink is a good indicator of… I don’t really know what. Flink adoption? Popularity? Growth? Not sure. But it’s being used widely as an argument to support their claims.
Context: Confluent reported its Flink ARR (Annual Recurring Revenue) for Q2 of 2025 to be $10M, stating 3x growth from the previous quarter. Q3 2025 report shows another 70% growth. So, the full picture so far, as far as I understand it: $3M (Q1) → $10M (Q2) → $17M (Q3).
Is it a low number? I don’t know. What are we comparing it with?
I’m confident that most of the data infrastructure companies launched in the past few years that compete with Flink aren’t even remotely close to that ARR number. But sure, public company vs startups, not a fair comparison.
Anyway, whatever number Confluent reports is pretty much irrelevant to Apache Flink, the technology. Here’s why:
Confluent Cloud doesn’t support the DataStream API, which is still the most popular and widely used way to author Flink pipelines.
Confluent Cloud doesn’t support any Flink connectors except the Kafka ones (or, to be correct, Confluent Kafka ones: you can’t use Confluent Flink without Confluent Kafka). This, again, eliminates a huge number of reasons to use Flink in the first place (e.g. compared to Kafka Streams). The ability to connect to arbitrary data stores, and not just Kafka, is really powerful.
Their SQL / Table API Cloud offering seemed to target the wrong audience: you had to write Flink SQL in a browser in a notebook-like environment. This is not what most engineers would want to use. This is not a Flink problem; this is a product problem. Thankfully, they’re addressing it now with IDE extensions (and the API is getting more mature).
The “I Don’t Know Java” Argument Is Outdated
Don’t want to write Java? Just use Flink SQL.
The majority of the new data processing technologies tend to favour SQL. Many of them also implement Incremental View Maintenance (IVM) semantics: reusing previously computed results to avoid full re-processing.
And it’s quite common to see this comparison: “look at the complex Flink Java code on one side and this simple SQL query on the other side”. Of course, the SQL query typically looks more concise. But Flink SQL has been around for years; you should really compare with that!
Flink SQL has its own problems, of course. State evolution is still a big problem. But it’s generally a tough problem for any system: very few IVM tools can actually reuse internal state when the query is modified.
Other than handling state evolution, Flink SQL (and the Table API) is mature and stable. Netflix has 10000+ Flink SQL pipelines. Shopify heavily invests in Flink SQL. LinkedIn uses Flink SQL to unify streaming and batch queries.
I believe Flink SQL can cover over 70% of common Flink use cases. If you’re writing a new Flink pipeline, look at Flink SQL first (then the Table API, and only then consider using the DataStream API).
I agree that trying to bridge databases and stream-processing systems is the right direction. Adopting the SQL-first approach makes sense. You get declarative pipelines, you leverage query optimizations, you get fully schematized datasets, etc. But you can not, and you should not discard Flink because it’s not a database-native tool: Flink SQL gets you really close.
Flink Is Ubiquitous
No matter what critics say, Flink adoption is really strong. Pretty much all Big Tech companies use Flink (it’s a long list to type, but just to name a few: Apple, Airbnb, Alibaba, Booking, DoorDash, eBay, Instacart, LinkedIn, Lyft, Netflix, Pinterest, Reddit, Salesforce, Shopify, Stripe, Uber). Name the top 10 companies in pretty much any vertical, and most of them would use Flink.
You could say that it happens because of historical reasons or due to inertia. Look at OpenAI then3: they could’ve chosen any bleeding-edge tech for their data processing needs, but they ended up using Flink. There are plenty of startups that choose Flink too. My favourite example is Grepr: you won’t find Flink mentioned anywhere except the job postings, yet it’s in the core of the product.
I also don’t see many examples of companies switching from Flink to other technologies. Of course it happens, but it’s quite rare (or I just don’t hear about this, so let me know).
Flink Future: Process Table Functions (PTFs) and Disaggregated Storage
I believe that Process Table Functions (PTFs) will take Flink to another level.
PTFs give access to state and timer primitives inside the Table API. So it’s possible to implement low-level, DataStream-style operators and seamlessly integrate them into your SQL logic.
Just look at this example. No, seriously, look at it and come back.
It’s amazing, isn’t it? If you scroll a bit further, you’ll find the example with timers.
I’m not aware of any other data processing technology that exposes such powerful primitives as state and timers to user-defined functions (UDF). And not just exposes, but does it so elegantly. Most UDFs out there are simple, stateless bits of logic. But with PTFs, it’s possible to implement any kind of stateful operator (join, aggregation, etc.) optimized for your specific use case.
I can even imagine a PTF marketplace in the future!
Disaggregated state backend storage introduced in Flink 2.x is another major initiative that I’m excited about. Large stateful Flink pipelines can be hard to manage, especially when it comes to recovery. Flink follows the popular trend of shifting local storage to object storage, which makes it possible to have relatively cheap pipelines with a lot of state without thinking about local storage.
Disaggregated storage is still rough around the edges, but the upcoming Flink 2.2 release should address some of the issues. Even though it’s still marked as experimental, I’d consider it for some production workloads after 2.2 is out.
Conclusion
I believe Flink has a very bright future. I made a big bet on that.
You could say I am incentivized to present Flink in the best light possible. But even if it’s the case, there is nothing wrong with that - I’m sure my readers are capable of making their own decisions. I’ve been writing repeatedly about the RAD stack (Rust, Arrow, DataFusion) as the next iteration of data processing tooling. I truly believe that Rust, Arrow, DataFusion and related projects will slowly replace Flink, Spark and a bunch of other “classic” Big Data tools. Not in the next few years, though.
Flink has a lot to offer. That’s why I’m working on an advanced Flink bootcamp and training - I’d like to spread my knowledge as much as I can. This bootcamp will be great for you if you already have some Flink knowledge, but you’d like to become an expert.
Finally, some recommendations to the critics:
Do better than forming echo chambers. We can learn from each other.
If you criticize Flink, try to be specific and offer an alternative solution. It’s easy to criticize without offering anything.
Vague statements like “Flink is complex” and “nobody needs realtime” are not helpful.
Kudos to those few who got the HOMM3 reference.
Which is another way to say: it depends.
Sure it was founded in 2015, but it really started to grow only in 2023.


Really nice points!
Brilliant defense of Flink! Your distinction between essential and accidential complexity is exacty what this debate needs. Too many people conflate the inherent challenges of stateful stream processing with implementation quirks that can be abstracted away.
The PTF example you linked is genuinely impressive. Exposing state and timer primitives through SQL UDFs while maintaining declarative query optimization feels like the perfect middle ground between DataStream flexibility and SQL accessibility.
One nuance: the disaggregated storage shift mirrors what we've seen in the OLAP space with Snowflake and others. Curious how checkpoint recovery latency will compare at scale once 2.2 stabilizes this.