-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-52969][SQL] Support DSv2 OrcScan Dynamic Partition Pruning #52009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a contribution, @flaming-archer .
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcScan.scala
Outdated
Show resolved
Hide resolved
...core/src/test/scala/org/apache/spark/sql/execution/datasources/FileBasedDataSourceTest.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcTest.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcTest.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the above comments, I have one question. I guess you did compare this with the corresponding ParquetScan.scala logic. Could you confirm if there is any difference, @flaming-archer ?
In fact, I also rewrote ParquetScan with the same code can work. But I haven't figured out how to change these two classes together yet. Or rewrite a class so that both inherit it. I used to write more Java and less Scala before。。 |
29356cc to
1151c13
Compare
|
@dongjoon-hyun Pls take a look at it again. I have used ./dev/scalafmt format scala files already. |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being late, @flaming-archer . I was on vacation since 8/15 until this Monday.
While revisiting your contribution, I'd like to recommend you to make a Parquet PR first to get more community attentions and to enlarge the impact of your contributions.
In fact, I also rewrote ParquetScan with the same code can work. But I haven't figured out how to change these two classes together yet. Or rewrite a class so that both inherit it.
As you know, Parquet is the Apache Spark's default file-based source format. Most test coverages are based on Parquet features. So, your contribution will be tested fully if you make a Parquet PR first. After than, we can come back to the other data sources like ORC (this PR).
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Lines 1838 to 1842 in 0c9af99
| val DEFAULT_DATA_SOURCE_NAME = buildConf("spark.sql.sources.default") | |
| .doc("The default data source to use in input/output.") | |
| .version("1.3.0") | |
| .stringConf | |
| .createWithDefault("parquet") |
No problem, I will submit a Parquet version later in the next day or two. |
|
Thank you so much, @flaming-archer . |
What changes were proposed in this pull request?
Support ORCScan Dynamic Partition Prune to greatly improves performance.
Why are the changes needed?
Improve dsv2 and enhance the query performance of commonly used ORC files
Does this PR introduce any user-facing change?
No.
How was this patch tested?
UT.
Was this patch authored or co-authored using generative AI tooling?
NO.