[SPARK-52969][SQL] Support DSv2 OrcScan Dynamic Partition Pruning #52009

flaming-archer · 2025-08-13T07:55:25Z

What changes were proposed in this pull request?

Support ORCScan Dynamic Partition Prune to greatly improves performance.

Why are the changes needed?

Improve dsv2 and enhance the query performance of commonly used ORC files

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT.

Was this patch authored or co-authored using generative AI tooling?

NO.

dongjoon-hyun

Thank you for making a contribution, @flaming-archer .

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScan.scala

...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileBasedDataSourceTest.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcTest.scala

dongjoon-hyun

In addition to the above comments, I have one question. I guess you did compare this with the corresponding ParquetScan.scala logic. Could you confirm if there is any difference, @flaming-archer ?

flaming-archer · 2025-08-14T01:43:40Z

In addition to the above comments, I have one question. I guess you did compare this with the corresponding ParquetScan.scala logic. Could you confirm if there is any difference, @flaming-archer ?

In fact, I also rewrote ParquetScan with the same code can work. But I haven't figured out how to change these two classes together yet. Or rewrite a class so that both inherit it.

I used to write more Java and less Scala before。。

flaming-archer · 2025-08-14T07:59:08Z

@dongjoon-hyun Pls take a look at it again. I have used ./dev/scalafmt format scala files already.

dongjoon-hyun

Sorry for being late, @flaming-archer . I was on vacation since 8/15 until this Monday.

While revisiting your contribution, I'd like to recommend you to make a Parquet PR first to get more community attentions and to enlarge the impact of your contributions.

In fact, I also rewrote ParquetScan with the same code can work. But I haven't figured out how to change these two classes together yet. Or rewrite a class so that both inherit it.

As you know, Parquet is the Apache Spark's default file-based source format. Most test coverages are based on Parquet features. So, your contribution will be tested fully if you make a Parquet PR first. After than, we can come back to the other data sources like ORC (this PR).

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 1838 to 1842 in 0c9af99

    
           val DEFAULT_DATA_SOURCE_NAME = buildConf("spark.sql.sources.default") 
        
             .doc("The default data source to use in input/output.") 
        
             .version("1.3.0") 
        
             .stringConf 
        
             .createWithDefault("parquet")

flaming-archer · 2025-08-28T03:53:28Z

Sorry for being late, @flaming-archer . I was on vacation since 8/15 until this Monday.对不起迟到了，。我从 8 月 15 日到本周一一直在度假。

While revisiting your contribution, I'd like to recommend you to make a Parquet PR first to get more community attentions and to enlarge the impact of your contributions.在重新审视您的贡献时，我想建议您先进行 Parquet PR，以获得更多社区关注并扩大您的贡献的影响力。

In fact, I also rewrote ParquetScan with the same code can work. But I haven't figured out how to change these two classes together yet. Or rewrite a class so that both inherit it.事实上，我也用相同的代码重写了 ParquetScan 也可以工作。但我还没有想出如何将这两个类一起改变。或者重写一个类，以便两者都继承它。

As you know, Parquet is the Apache Spark's default file-based source format. Most test coverages are based on Parquet features. So, your contribution will be tested fully if you make a Parquet PR first. After than, we can come back to the other data sources like ORC (this PR).如您所知，Parquet 是 Apache Spark 默认的基于文件的源格式。大多数测试覆盖率都基于 Parquet 功能。因此，如果您先进行 Parquet PR，您的贡献将得到充分测试。之后，我们可以回到其他数据源，如 ORC（此 PR）。

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 1838 to 1842 in 0c9af99

val DEFAULT_DATA_SOURCE_NAME = buildConf("spark.sql.sources.default")

.doc("The default data source to use in input/output.")

.version("1.3.0")

.stringConf

.createWithDefault("parquet")

No problem, I will submit a Parquet version later in the next day or two.

dongjoon-hyun · 2025-08-28T05:10:37Z

Thank you so much, @flaming-archer .

support DSv2 OrcScan dpp

d820d82

github-actions bot added the SQL label Aug 13, 2025

dongjoon-hyun changed the title ~~[SPARK-52969][SQL] Support DSv2 OrcScan dpp~~ [SPARK-52969][SQL] Support DSv2 OrcScan Dynamic Partition Prune Aug 13, 2025