Brilliant defense of Flink! Your distinction between essential and accidential complexity is exacty what this debate needs. Too many people conflate the inherent challenges of stateful stream processing with implementation quirks that can be abstracted away.
The PTF example you linked is genuinely impressive. Exposing state and timer primitives through SQL UDFs while maintaining declarative query optimization feels like the perfect middle ground between DataStream flexibility and SQL accessibility.
One nuance: the disaggregated storage shift mirrors what we've seen in the OLAP space with Snowflake and others. Curious how checkpoint recovery latency will compare at scale once 2.2 stabilizes this.
I’m a bit confused about the purpose of PTF. Flink already allows us to register a DataStream as a view and use that view in SQL. Given this, why do we still need PTF?
If you think about this from the perspective of the DataStream API: sure, you create a DataStream and register "a view" for that (although, technically, you're converting a DataStream into a Table).
But if you're coming from the SQL perspective, you'd like to stay in the SQL world as much as possible (so you can get declarative pipelines and optimizations). For example, consider this simple pipeline: Source -> Transformation -> Sink. I'd like to represent everything in SQL. If I have a non-trivial transformation that needs to be expressed programmatically, in order to leverage the DataStream API for that, I'll need to convert a Table to a DataStream first, then apply the transformation, then convert it back to a Table. So it becomes Source (Table) -> DataStream -> Transformation -> DataStream -> Sink (Table). This is very inefficient, especially in more complex pipelines. PTFs allow me to fully stay in the SQL world, even though creating them requires Java programming.
To be honest, if users need to write code to solve the problem, I think it's better to use the DataStream/Table API from my perspective. Because it's much easier to test and maintain if we only use the Table API here.
Maybe I'm too old to embrace new concepts. But the examples of PTF in the documentation are too simplistic to convince me to use PTF over the DataStream API.
Many Flink jobs today are basically "joiners/assemblers" without too much logic. In my opinion, ideally that should be handled at the storage layer directly (Fluss!) but the next best thing would be something as declarative as possible. I agree with Yaro on that PTFs are a game changer for this type of jobs to absorb the logic which is hard to put in SQL terms. Plus you gain all the control that you typically have in the DataStream API (e.g., custom state, timers...)—without the back and forth between APIs. I guess you could also automatically benefit from multi-way joins as implemented in FLIP-516, which is a clear advantage. There is no such thing (at least, built-in) in the DataStream API that I'm aware of, for instance. I'm personally running an experiment these days which consists of rewriting a long series of co-joins (implemented via CoProcessFunctions) using a single multi-input operator. For that I'm using several approaches, one of them being SQL/Table API + PTFs.
Brilliant defense of Flink! Your distinction between essential and accidential complexity is exacty what this debate needs. Too many people conflate the inherent challenges of stateful stream processing with implementation quirks that can be abstracted away.
The PTF example you linked is genuinely impressive. Exposing state and timer primitives through SQL UDFs while maintaining declarative query optimization feels like the perfect middle ground between DataStream flexibility and SQL accessibility.
One nuance: the disaggregated storage shift mirrors what we've seen in the OLAP space with Snowflake and others. Curious how checkpoint recovery latency will compare at scale once 2.2 stabilizes this.
Totally agree! I've personally faced those critics in my work place—the naysayers!
Really nice points!
I’m a bit confused about the purpose of PTF. Flink already allows us to register a DataStream as a view and use that view in SQL. Given this, why do we still need PTF?
If you think about this from the perspective of the DataStream API: sure, you create a DataStream and register "a view" for that (although, technically, you're converting a DataStream into a Table).
But if you're coming from the SQL perspective, you'd like to stay in the SQL world as much as possible (so you can get declarative pipelines and optimizations). For example, consider this simple pipeline: Source -> Transformation -> Sink. I'd like to represent everything in SQL. If I have a non-trivial transformation that needs to be expressed programmatically, in order to leverage the DataStream API for that, I'll need to convert a Table to a DataStream first, then apply the transformation, then convert it back to a Table. So it becomes Source (Table) -> DataStream -> Transformation -> DataStream -> Sink (Table). This is very inefficient, especially in more complex pipelines. PTFs allow me to fully stay in the SQL world, even though creating them requires Java programming.
I'll be writing about this more, so stay tuned.
To be honest, if users need to write code to solve the problem, I think it's better to use the DataStream/Table API from my perspective. Because it's much easier to test and maintain if we only use the Table API here.
Maybe I'm too old to embrace new concepts. But the examples of PTF in the documentation are too simplistic to convince me to use PTF over the DataStream API.
Looking forward to your new blog.
Many Flink jobs today are basically "joiners/assemblers" without too much logic. In my opinion, ideally that should be handled at the storage layer directly (Fluss!) but the next best thing would be something as declarative as possible. I agree with Yaro on that PTFs are a game changer for this type of jobs to absorb the logic which is hard to put in SQL terms. Plus you gain all the control that you typically have in the DataStream API (e.g., custom state, timers...)—without the back and forth between APIs. I guess you could also automatically benefit from multi-way joins as implemented in FLIP-516, which is a clear advantage. There is no such thing (at least, built-in) in the DataStream API that I'm aware of, for instance. I'm personally running an experiment these days which consists of rewriting a long series of co-joins (implemented via CoProcessFunctions) using a single multi-input operator. For that I'm using several approaches, one of them being SQL/Table API + PTFs.