Dec 3, 2025

Not in the next few years.

6 Comments

Salva Alcántara

Dec 17Edited

Totally agree! I've personally faced those critics in my work place—the naysayers!

Really nice points!

I’m a bit confused about the purpose of PTF. Flink already allows us to register a DataStream as a view and use that view in SQL. Given this, why do we still need PTF?

If you think about this from the perspective of the DataStream API: sure, you create a DataStream and register "a view" for that (although, technically, you're converting a DataStream into a Table).

But if you're coming from the SQL perspective, you'd like to stay in the SQL world as much as possible (so you can get declarative pipelines and optimizations). For example, consider this simple pipeline: Source -> Transformation -> Sink. I'd like to represent everything in SQL. If I have a non-trivial transformation that needs to be expressed programmatically, in order to leverage the DataStream API for that, I'll need to convert a Table to a DataStream first, then apply the transformation, then convert it back to a Table. So it becomes Source (Table) -> DataStream -> Transformation -> DataStream -> Sink (Table). This is very inefficient, especially in more complex pipelines. PTFs allow me to fully stay in the SQL world, even though creating them requires Java programming.

I'll be writing about this more, so stay tuned.

To be honest, if users need to write code to solve the problem, I think it's better to use the DataStream/Table API from my perspective. Because it's much easier to test and maintain if we only use the Table API here.

Maybe I'm too old to embrace new concepts. But the examples of PTF in the documentation are too simplistic to convince me to use PTF over the DataStream API.

Looking forward to your new blog.

Reply (1)

Share

Salva Alcántara

Dec 17Edited

Many Flink jobs today are basically "joiners/assemblers" without too much logic. In my opinion, ideally that should be handled at the storage layer directly (Fluss!) but the next best thing would be something as declarative as possible. I agree with Yaro on that PTFs are a game changer for this type of jobs to absorb the logic which is hard to put in SQL terms. Plus you gain all the control that you typically have in the DataStream API (e.g., custom state, timers...)—without the back and forth between APIs. I guess you could also automatically benefit from multi-way joins as implemented in FLIP-516, which is a clear advantage. There is no such thing (at least, built-in) in the DataStream API that I'm aware of, for instance. I'm personally running an experiment these days which consists of rewriting a long series of co-joins (implemented via CoProcessFunctions) using a single multi-input operator. For that I'm using several approaches, one of them being SQL/Table API + PTFs.

Reply

Share

Data Streaming Journal

Why Apache Flink Is Not Going Anywhere