Skip to content

Conversation

@flaming-archer
Copy link

@flaming-archer flaming-archer commented Aug 13, 2025

What changes were proposed in this pull request?

Support ORCScan Dynamic Partition Prune to greatly improves performance.

Why are the changes needed?

Improve dsv2 and enhance the query performance of commonly used ORC files

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT.

Was this patch authored or co-authored using generative AI tooling?

NO.

@github-actions github-actions bot added the SQL label Aug 13, 2025
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-52969][SQL] Support DSv2 OrcScan dpp [SPARK-52969][SQL] Support DSv2 OrcScan Dynamic Partition Prune Aug 13, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a contribution, @flaming-archer .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the above comments, I have one question. I guess you did compare this with the corresponding ParquetScan.scala logic. Could you confirm if there is any difference, @flaming-archer ?

@flaming-archer
Copy link
Author

In addition to the above comments, I have one question. I guess you did compare this with the corresponding ParquetScan.scala logic. Could you confirm if there is any difference, @flaming-archer ?

In fact, I also rewrote ParquetScan with the same code can work. But I haven't figured out how to change these two classes together yet. Or rewrite a class so that both inherit it.

I used to write more Java and less Scala before。。

@flaming-archer
Copy link
Author

@dongjoon-hyun Pls take a look at it again. I have used ./dev/scalafmt format scala files already.

@flaming-archer flaming-archer changed the title [SPARK-52969][SQL] Support DSv2 OrcScan Dynamic Partition Prune [SPARK-52969][SQL] Support DSv2 OrcScan Dynamic Partition Pruning Aug 14, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being late, @flaming-archer . I was on vacation since 8/15 until this Monday.

While revisiting your contribution, I'd like to recommend you to make a Parquet PR first to get more community attentions and to enlarge the impact of your contributions.

In fact, I also rewrote ParquetScan with the same code can work. But I haven't figured out how to change these two classes together yet. Or rewrite a class so that both inherit it.

As you know, Parquet is the Apache Spark's default file-based source format. Most test coverages are based on Parquet features. So, your contribution will be tested fully if you make a Parquet PR first. After than, we can come back to the other data sources like ORC (this PR).

val DEFAULT_DATA_SOURCE_NAME = buildConf("spark.sql.sources.default")
.doc("The default data source to use in input/output.")
.version("1.3.0")
.stringConf
.createWithDefault("parquet")

@flaming-archer
Copy link
Author

Sorry for being late, @flaming-archer . I was on vacation since 8/15 until this Monday.对不起迟到了,。我从 8 月 15 日到本周一一直在度假。

While revisiting your contribution, I'd like to recommend you to make a Parquet PR first to get more community attentions and to enlarge the impact of your contributions.在重新审视您的贡献时,我想建议您先进行 Parquet PR,以获得更多社区关注并扩大您的贡献的影响力。

In fact, I also rewrote ParquetScan with the same code can work. But I haven't figured out how to change these two classes together yet. Or rewrite a class so that both inherit it.事实上,我也用相同的代码重写了 ParquetScan 也可以工作。但我还没有想出如何将这两个类一起改变。或者重写一个类,以便两者都继承它。

As you know, Parquet is the Apache Spark's default file-based source format. Most test coverages are based on Parquet features. So, your contribution will be tested fully if you make a Parquet PR first. After than, we can come back to the other data sources like ORC (this PR).如您所知,Parquet 是 Apache Spark 默认的基于文件的源格式。大多数测试覆盖率都基于 Parquet 功能。因此,如果您先进行 Parquet PR,您的贡献将得到充分测试。之后,我们可以回到其他数据源,如 ORC(此 PR)。

val DEFAULT_DATA_SOURCE_NAME = buildConf("spark.sql.sources.default")
.doc("The default data source to use in input/output.")
.version("1.3.0")
.stringConf
.createWithDefault("parquet")

No problem, I will submit a Parquet version later in the next day or two.

@dongjoon-hyun
Copy link
Member

Thank you so much, @flaming-archer .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants