Brilliant defense of Flink! Your distinction between essential and accidential complexity is exacty what this debate needs. Too many people conflate the inherent challenges of stateful stream processing with implementation quirks that can be abstracted away.
The PTF example you linked is genuinely impressive. Exposing state and timer primitives through SQL UDFs while maintaining declarative query optimization feels like the perfect middle ground between DataStream flexibility and SQL accessibility.
One nuance: the disaggregated storage shift mirrors what we've seen in the OLAP space with Snowflake and others. Curious how checkpoint recovery latency will compare at scale once 2.2 stabilizes this.
I’m a bit confused about the purpose of PTF. Flink already allows us to register a DataStream as a view and use that view in SQL. Given this, why do we still need PTF?
If you think about this from the perspective of the DataStream API: sure, you create a DataStream and register "a view" for that (although, technically, you're converting a DataStream into a Table).
But if you're coming from the SQL perspective, you'd like to stay in the SQL world as much as possible (so you can get declarative pipelines and optimizations). For example, consider this simple pipeline: Source -> Transformation -> Sink. I'd like to represent everything in SQL. If I have a non-trivial transformation that needs to be expressed programmatically, in order to leverage the DataStream API for that, I'll need to convert a Table to a DataStream first, then apply the transformation, then convert it back to a Table. So it becomes Source (Table) -> DataStream -> Transformation -> DataStream -> Sink (Table). This is very inefficient, especially in more complex pipelines. PTFs allow me to fully stay in the SQL world, even though creating them requires Java programming.
To be honest, if users need to write code to solve the problem, I think it's better to use the DataStream/Table API from my perspective. Because it's much easier to test and maintain if we only use the Table API here.
Maybe I'm too old to embrace new concepts. But the examples of PTF in the documentation are too simplistic to convince me to use PTF over the DataStream API.
Really nice points!
Brilliant defense of Flink! Your distinction between essential and accidential complexity is exacty what this debate needs. Too many people conflate the inherent challenges of stateful stream processing with implementation quirks that can be abstracted away.
The PTF example you linked is genuinely impressive. Exposing state and timer primitives through SQL UDFs while maintaining declarative query optimization feels like the perfect middle ground between DataStream flexibility and SQL accessibility.
One nuance: the disaggregated storage shift mirrors what we've seen in the OLAP space with Snowflake and others. Curious how checkpoint recovery latency will compare at scale once 2.2 stabilizes this.
I’m a bit confused about the purpose of PTF. Flink already allows us to register a DataStream as a view and use that view in SQL. Given this, why do we still need PTF?
If you think about this from the perspective of the DataStream API: sure, you create a DataStream and register "a view" for that (although, technically, you're converting a DataStream into a Table).
But if you're coming from the SQL perspective, you'd like to stay in the SQL world as much as possible (so you can get declarative pipelines and optimizations). For example, consider this simple pipeline: Source -> Transformation -> Sink. I'd like to represent everything in SQL. If I have a non-trivial transformation that needs to be expressed programmatically, in order to leverage the DataStream API for that, I'll need to convert a Table to a DataStream first, then apply the transformation, then convert it back to a Table. So it becomes Source (Table) -> DataStream -> Transformation -> DataStream -> Sink (Table). This is very inefficient, especially in more complex pipelines. PTFs allow me to fully stay in the SQL world, even though creating them requires Java programming.
I'll be writing about this more, so stay tuned.
To be honest, if users need to write code to solve the problem, I think it's better to use the DataStream/Table API from my perspective. Because it's much easier to test and maintain if we only use the Table API here.
Maybe I'm too old to embrace new concepts. But the examples of PTF in the documentation are too simplistic to convince me to use PTF over the DataStream API.
Looking forward to your new blog.